-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: implement intent resolution using LockTableIterator #110324
storage: implement intent resolution using LockTableIterator #110324
Conversation
2bb94d4
to
a111678
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
-- commits
line 146 at r6:
Is the problem that the LockUpdate does not tell us what locks were held by this txn on the span?
What if we tracked spans for each lock strength in the txn record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
Previously, sumeerbhola wrote…
Is the problem that the LockUpdate does not tell us what locks were held by this txn on the span?
What if we tracked spans for each lock strength in the txn record?
I can't remember whether this optimization was prompted by only microbenchmarks. The second bullet in the commit comment is not optimized even before separated intents. Perhaps I was being extra sensitive to intent resolution performance at that time, given intent resolution latency is important for lowering contention.
Another possibility would be to do a series of multiple SeekGEWithLimit calls for each of the strengths. We don't fully optimize this monotonic seek case in pebble.Iterator
but it is not hard to change that to use next calls under the hood. We would still need to produce a limit key, which is a bit problematic since it needs to be a valid EngineKey
for cmp
to work and needs to be exclusive. If we can efficiently construct the immediately smaller txnid, then we can produce a limit key.
7efccf0
to
2267f58
Compare
Right, we don't currently track which locks with which strengths are held on which keys, only that there are locks with some strength held on certain keys. Tracking the strength of locks in the coordinator and then storing locks with their strength in the txn record would let us perform more precise resolution.
From what I was able to dig up, it sounds like one of the SQL logic tests that performed ranged intent resolution became significantly slower with separated intents. I don't see any mention of single-key resolution being so slow that we had to optimize it. It makes me wonder whether the first bullet in the commit message was an opportunistic improvement. I'm running the benchmarks introduced in that commit over this PR to get a better sense for the impact.
I've been considering this. I would improve the worst-case performance of single-key resolution in exchange for a more expensive common case (3 seeks instead of 1). I wonder if that's worth it. We already scan over all locks on a key (including tombstones) when writing these intents, so the degenerate quadratic case where a single key is written and resolved in fast succession is quadratic no matter what we do here. |
Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming we want to bottom out on the tombstones discussion before we merge -- I don't have too much to add, but I'll follow along.
Reviewed 9 of 9 files at r1, 1 of 1 files at r2, 5 of 5 files at r3, 2 of 2 files at r4, 7 of 7 files at r5, 1 of 5 files at r6, 28 of 28 files at r8, 25 of 25 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
found and the iterator advances past them while seeking. In such cases,
we're back to paying the cost of scanning over the tombstones.
I don't fully follow this bit -- could you ELI5 for me?
pkg/storage/mvcc.go
line 4574 at r8 (raw file):
// mvccResolveWriteIntent is the core logic for resolving an intent. The // function accepts the instructions for how to resolve the intent (encoded in
nit: s/the instructions/instructions?
pkg/storage/mvcc.go
line 4578 at r8 (raw file):
// Returns whether the provided intent was resolved (true) or whether the // resolution was a no-op (false). // REQUIRES: iter surfaces range keys via IterKeyTypePointsAndRanges.
nit: while we're here, could you add a new line above REQUIRES please?
pkg/storage/mvcc.go
line 5129 at r8 (raw file):
// We could also compute a tighter nextKey here if we wanted to. // TODO(nvanbenschoten): this resumeSpan won't be correct if there // are multiple locks on the same key. What if only of the locks for
nit: "only one"
Separately, could we include the strength when constructing lastResolvedKey
as well?
pkg/storage/mvcc.go
line 5134 at r8 (raw file):
return numKeys, numBytes, &roachpb.Span{Key: lastResolvedKey.Next(), EndKey: intentEndKey}, resumeReason, nil } ltEngineKey, err := ltIter.EngineKey()
[for me] Why can't we use LockTableKeyVersion
here, like we do in MVCCResolveWriteIntent
? Does it have something to do with iterator creation (prefix vs. not)?
pkg/storage/mvcc.go
line 5156 at r8 (raw file):
beforeBytes := rw.BufferedSize() var ok bool if ltKey.Strength == lock.Intent {
Is it worth pulling out a resolveLock
function? Feel free to defer this until we implement lock resolution for other lock strengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 28 files at r8, 10 of 25 files at r9, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
Previously, arulajmani (Arul Ajmani) wrote…
found and the iterator advances past them while seeking. In such cases,
we're back to paying the cost of scanning over the tombstones.I don't fully follow this bit -- could you ELI5 for me?
if the exact key we are looking for is not there, the internal Pebble DELs that have not been compacted away have to be iterated over. See the first bullet in the commit message for d1c91e0 -- that optimization works when the key is actually there.
pkg/storage/mvcc.go
line 5129 at r8 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: "only one"
Separately, could we include the strength when constructing
lastResolvedKey
as well?
The resume span and LockUpdate
is already set up to use roachpb.Key
so that would be a bigger change. And there is a maximum of 3 locks that the txn could have held on a roachpb.Key
so fully processing a key is not going to exceed the limit significantly.
pkg/storage/mvcc.go
line 5134 at r8 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
[for me] Why can't we use
LockTableKeyVersion
here, like we do inMVCCResolveWriteIntent
? Does it have something to do with iterator creation (prefix vs. not)?
We also need ltKey.Key
below which we don't know up front since we are resolving a whole range.
2267f58
to
a1ba790
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pebble exposes a
SeekGEWithLimit
method, but this "limit" value is expressed as a key and not as a number of steps.
The problem that -WithLimit
solves can't be expressed as a number of steps, right? My understanding is that it specifically helps in the case where the sought key doesn't exist, so the seek may need to next through an unbounded number of tombstones to arrive at the next live key.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani, @nvanbenschoten, and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The related problem that's relevant here is that NextWithLimit
doesn't have a way to bound the amount of work performed before bailing. That would be useful for the kind of step-a-few-times-then-seek heuristics we have in a few places (including #110754), which balance the best-case cost of Next with the worst-case cost of Seek. If these heuristics are built above the pebble iterator, they don't have a way to limit the number of tombstones they encounter.
Have we considered pushing such a heuristic into Pebble? Something like a iter.SeekAfterNextWithLimit(key, limit)
.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani and @sumeerbhola)
pkg/storage/mvcc.go
line 4574 at r8 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: s/the instructions/instructions?
Done.
pkg/storage/mvcc.go
line 4578 at r8 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: while we're here, could you add a new line above REQUIRES please?
Done.
pkg/storage/mvcc.go
line 5129 at r8 (raw file):
Previously, sumeerbhola wrote…
The resume span and
LockUpdate
is already set up to useroachpb.Key
so that would be a bigger change. And there is a maximum of 3 locks that the txn could have held on aroachpb.Key
so fully processing a key is not going to exceed the limit significantly.
Yeah, I think it's cleanest to just keep this key-oriented. But it's a TODO I need to address in #110480 before merging that change.
pkg/storage/mvcc.go
line 5156 at r8 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Is it worth pulling out a
resolveLock
function? Feel free to defer this until we implement lock resolution for other lock strengths.
Yep, see #110480 😃
a1ba790
to
c01eeee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered pushing such a heuristic into Pebble? Something like a
iter.SeekAfterNextWithLimit(key, limit)
.
SeekGEWithLimit
will transparently try to next internally for repeated seeks with monotonically increasing keys, but it has limitations. See:
https://github.com/cockroachdb/pebble/blob/bb00504445221cbb49864fea1277905a5ee518dc/iterator.go#L1207-L1249 https://github.com/cockroachdb/pebble/blob/bb00504445221cbb49864fea1277905a5ee518dc/sstable/reader_iter_single_lvl.go#L669-L696
I think this is what motivated @sumeerbhola's three-seek suggestion, but I'm not sure if any of the TrySeekUsingNext limitations apply here.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani and @sumeerbhola)
Informs cockroachdb#109648. This commit implements intent resolution (point and ranged) using a `LockTableIterator`, configured to return all locks for the transaction being resolved and no locks from other transactions. This is the first step towards releasing replicated locks during intent resolution. While switching to a LockTableIterator, the commit is also able to remove separatedIntentAndVersionIter, iterForKeyVersions and mvccGetIntent, which were all used to avoid handing an MVCCMetadata directly to mvccResolveWriteIntent. Instead of continuing to treat intents as interleaved, we switch to handling intents entirely separately from their provisional value during intent resolution, which avoids jumping through these hoops and makes the code simpler. The change to `TestMVCCResolveTxnRangeResumeWithManyVersions` is immaterial and has to do with the transaction ID filter being applied before the key limit (inside LockTableIterator), instead of after. The new behavior is actually better. ---- One concern I have about this change is that it removes the call to `SeekIntentGE` in `MVCCResolveWriteIntent`, which was added in d1c91e0 to guard against the case where many pebble tombstones from prior intents on a key surround the intent being resolved. Conceptually, we'd like to push optimizations that avoid scanning over these tombstones into the `LockTableIterator` like we plan to do for skipping over non-conflicting locks. Doing so would benefit all lock strengths. It would also benefit the case where an intent is not found and the seek hits tombstones from prior intents on later versions. However, it's not clear how to do this with the current Pebble API. Pebble exposes a `SeekGEWithLimit` method, but this "limit" value is expressed as a key and not as a number of steps. How would we construct a limit key to bound the number of tombstones a seek observes before seeking directly to a specific (txn_id, lock_strength) version? One option would be to seek to specific versions in the `LockTableIterator` when advancing the iterator in cases where the iterator is configured to match a specific txn ID. For example, performing the following translations: ``` SeekGE({Key: k}) -> SeekGE({Key: k, Strength: Intent, TxnID: <txn_id>}) Next() -> SeekGE({Key: k, Strength: Exclusive, TxnID: <txn_id>}) Next() -> SeekGE({Key: k, Strength: Shared, TxnID: <txn_id>}) ``` Of course, this gets more complicated when some of these locks are not found and the iterator advances past them while seeking. In such cases, we're back to paying the cost of scanning over the tombstones. If we knew which lock strengths we had acquired on a key, we could avoid some of this cost, but that would require API changes and client buy-in to track lock spans on a per-strength basis. I'll capture the impact of this change on the following benchmarks and evaluate: * BenchmarkIntentResolution * BenchmarkIntentRangeResolution * BenchmarkIntentScan Release note: Nonet
Now that the specialized method is no longer used in `MVCCResolveWriteIntent`, we can delete it. Epic: None Release note: None
c01eeee
to
4e46443
Compare
Here is the impact that this change has on the benchmarks added in d1c91e0. We see that it's non-impactful except for the most degenerate cases, as expected:
Given that these cases impact the front half of mutations regardless of what we do during intent resolution, I'm ok with the limited regressions on this microbenchmark. Still, to minimize the impact, I re-introduced half of the previous optimization into TFTRs! bors r+ |
Build succeeded: |
For completeness, here's the impact of this change on
|
Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None
Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None
110754: storage: implement iter-before-seek optimization for LockTableIterator r=nvanbenschoten a=nvanbenschoten Informs #100193. This commit addresses a TODO left by #110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in #110324. Release note: None 111126: pkg/util/log: introduce new metrics to the logging package r=knz a=abarganier See individual commit messages to review. This PR introduces two new metrics to the logging package: - `log.messages.count` - `log.buffered.messages.dropped` **log.messages.count**: This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes. **log.buffered.messages.dropped**: Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new. This PR also renames the metric `fluent.sink.conn.errors` to `log.fluent.sink.conn.errors`, for consistency. Fixes: #72453 ---- Release note (ops change): This patch sets the Metric Type on the metric `log.fluent.sink.conn.errors`. Previously, the Metric Type was incorrectly left unset. Note that this is simply an update to the metric's metadata. The behavior and purpose of the metric remains unchanged. ---- Release note (ops change): This patch introduces the metric, `log.messages.count`. This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes. ---- Release note (ops change): This patch introduces a new metric, `log.buffered.messages.dropped`. Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new. `log.buffered.messages.dropped` counts the number of messages dropped from the buffer. Note that the count is shared across all buffered logging sinks. 111603: server: always START SERVICE SHARED in testserver.StartSharedProcessTenant() r=knz a=msbutler Previously, StartSharedProcessTenant() would hang if it were run on a tenant that was created by a replication stream. This patch fixes this bug by ensuring `ALTER TENANT $1 START SERVICE SHARED` is run even if the tenant was already created. Epic: none Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Michael Butler <[email protected]>
Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None
Informs #109648.
This commit implements intent resolution (point and ranged) using a
LockTableIterator
, configured to return all locks for the transaction being resolved and no locks from other transactions. This is the first step towards releasing replicated locks during intent resolution.While switching to a LockTableIterator, the commit is also able to remove separatedIntentAndVersionIter, iterForKeyVersions and mvccGetIntent, which were all used to avoid handing an MVCCMetadata directly to mvccResolveWriteIntent. Instead of continuing to treat intents as interleaved, we switch to handling intents entirely separately from their provisional value during intent resolution, which avoids jumping through these hoops and makes the code simpler.
The change to
TestMVCCResolveTxnRangeResumeWithManyVersions
is immaterial and has to do with the transaction ID filter being applied before the key limit (inside LockTableIterator), instead of after. The new behavior is actually better.One concern I have about this change is that it removes the call to
SeekIntentGE
inMVCCResolveWriteIntent
, which was added in d1c91e0 to guard against the case where many pebble tombstones from prior intents from different txns on a key surround the intent being resolved. Conceptually, we'd like to push optimizations that avoid scanning over these tombstones into theLockTableIterator
like we plan to do for skipping over non-conflicting locks. Doing so would benefit all lock strengths. It would also benefit the case where an intent is not found and the seek hits tombstones from prior intents on later versions.However, it's not clear how to do this with the current Pebble API. Pebble exposes a
SeekGEWithLimit
method, but this "limit" value is expressed as a key and not as a number of steps. How would we construct a limit key to bound the number of tombstones a seek observes before seeking directly to a specific (txn_id, lock_strength) version?One option would be to seek to specific versions in the
LockTableIterator
when advancing the iterator in cases where the iterator is configured to match a specific txn ID. For example, performing the following translations:Of course, this gets more complicated when some of these locks are not found and the iterator advances past them while seeking. In such cases, we're back to paying the cost of scanning over the tombstones.
If we knew which lock strengths we had acquired on a key, we could avoid some of this cost, but that would require API changes and client buy-in to track lock spans on a per-strength basis.
I'll capture the impact of this change on the following benchmarks and evaluate:
Release note: None