Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: implement intent resolution using LockTableIterator #110324

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Sep 11, 2023

Informs #109648.

This commit implements intent resolution (point and ranged) using a LockTableIterator, configured to return all locks for the transaction being resolved and no locks from other transactions. This is the first step towards releasing replicated locks during intent resolution.

While switching to a LockTableIterator, the commit is also able to remove separatedIntentAndVersionIter, iterForKeyVersions and mvccGetIntent, which were all used to avoid handing an MVCCMetadata directly to mvccResolveWriteIntent. Instead of continuing to treat intents as interleaved, we switch to handling intents entirely separately from their provisional value during intent resolution, which avoids jumping through these hoops and makes the code simpler.

The change to TestMVCCResolveTxnRangeResumeWithManyVersions is immaterial and has to do with the transaction ID filter being applied before the key limit (inside LockTableIterator), instead of after. The new behavior is actually better.


One concern I have about this change is that it removes the call to SeekIntentGE in MVCCResolveWriteIntent, which was added in d1c91e0 to guard against the case where many pebble tombstones from prior intents from different txns on a key surround the intent being resolved. Conceptually, we'd like to push optimizations that avoid scanning over these tombstones into the LockTableIterator like we plan to do for skipping over non-conflicting locks. Doing so would benefit all lock strengths. It would also benefit the case where an intent is not found and the seek hits tombstones from prior intents on later versions.

However, it's not clear how to do this with the current Pebble API. Pebble exposes a SeekGEWithLimit method, but this "limit" value is expressed as a key and not as a number of steps. How would we construct a limit key to bound the number of tombstones a seek observes before seeking directly to a specific (txn_id, lock_strength) version?

One option would be to seek to specific versions in the LockTableIterator when advancing the iterator in cases where the iterator is configured to match a specific txn ID. For example, performing the following translations:

SeekGE({Key: k}) -> SeekGE({Key: k, Strength: Intent, TxnID: <txn_id>})
Next()           -> SeekGE({Key: k, Strength: Exclusive, TxnID: <txn_id>})
Next()           -> SeekGE({Key: k, Strength: Shared, TxnID: <txn_id>})

Of course, this gets more complicated when some of these locks are not found and the iterator advances past them while seeking. In such cases, we're back to paying the cost of scanning over the tombstones.

If we knew which lock strengths we had acquired on a key, we could avoid some of this cost, but that would require API changes and client buy-in to track lock spans on a per-strength basis.

I'll capture the impact of this change on the following benchmarks and evaluate:

  • BenchmarkIntentResolution
  • BenchmarkIntentRangeResolution
  • BenchmarkIntentScan

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch from 2bb94d4 to a111678 Compare September 11, 2023 14:13
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits line 146 at r6:
Is the problem that the LockUpdate does not tell us what locks were held by this txn on the span?
What if we tracked spans for each lock strength in the txn record?

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits line 146 at r6:

Previously, sumeerbhola wrote…

Is the problem that the LockUpdate does not tell us what locks were held by this txn on the span?
What if we tracked spans for each lock strength in the txn record?

I can't remember whether this optimization was prompted by only microbenchmarks. The second bullet in the commit comment is not optimized even before separated intents. Perhaps I was being extra sensitive to intent resolution performance at that time, given intent resolution latency is important for lowering contention.

Another possibility would be to do a series of multiple SeekGEWithLimit calls for each of the strengths. We don't fully optimize this monotonic seek case in pebble.Iterator but it is not hard to change that to use next calls under the hood. We would still need to produce a limit key, which is a bit problematic since it needs to be a valid EngineKey for cmp to work and needs to be exclusive. If we can efficiently construct the immediately smaller txnid, then we can produce a limit key.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch 2 times, most recently from 7efccf0 to 2267f58 Compare September 15, 2023 16:18
@nvanbenschoten
Copy link
Member Author

Is the problem that the LockUpdate does not tell us what locks were held by this txn on the span?
What if we tracked spans for each lock strength in the txn record?

Right, we don't currently track which locks with which strengths are held on which keys, only that there are locks with some strength held on certain keys. Tracking the strength of locks in the coordinator and then storing locks with their strength in the txn record would let us perform more precise resolution.

I can't remember whether this optimization was prompted by only microbenchmarks.

From what I was able to dig up, it sounds like one of the SQL logic tests that performed ranged intent resolution became significantly slower with separated intents. I don't see any mention of single-key resolution being so slow that we had to optimize it. It makes me wonder whether the first bullet in the commit message was an opportunistic improvement.

I'm running the benchmarks introduced in that commit over this PR to get a better sense for the impact.

Another possibility would be to do a series of multiple SeekGEWithLimit calls for each of the strengths. We don't fully optimize this monotonic seek case in pebble.Iterator but it is not hard to change that to use next calls under the hood. We would still need to produce a limit key, which is a bit problematic since it needs to be a valid EngineKey for cmp to work and needs to be exclusive. If we can efficiently construct the immediately smaller txnid, then we can produce a limit key.

I've been considering this. I would improve the worst-case performance of single-key resolution in exchange for a more expensive common case (3 seeks instead of 1). I wonder if that's worth it. We already scan over all locks on a key (including tombstones) when writing these intents, so the degenerate quadratic case where a single key is written and resolved in fast succession is quadratic no matter what we do here.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Sep 15, 2023
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I'm assuming we want to bottom out on the tombstones discussion before we merge -- I don't have too much to add, but I'll follow along.

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, 5 of 5 files at r3, 2 of 2 files at r4, 7 of 7 files at r5, 1 of 5 files at r6, 28 of 28 files at r8, 25 of 25 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


-- commits line 52 at r9:

found and the iterator advances past them while seeking. In such cases,
we're back to paying the cost of scanning over the tombstones.

I don't fully follow this bit -- could you ELI5 for me?


pkg/storage/mvcc.go line 4574 at r8 (raw file):

// mvccResolveWriteIntent is the core logic for resolving an intent. The
// function accepts the instructions for how to resolve the intent (encoded in

nit: s/the instructions/instructions?


pkg/storage/mvcc.go line 4578 at r8 (raw file):

// Returns whether the provided intent was resolved (true) or whether the
// resolution was a no-op (false).
// REQUIRES: iter surfaces range keys via IterKeyTypePointsAndRanges.

nit: while we're here, could you add a new line above REQUIRES please?


pkg/storage/mvcc.go line 5129 at r8 (raw file):

			// We could also compute a tighter nextKey here if we wanted to.
			// TODO(nvanbenschoten): this resumeSpan won't be correct if there
			// are multiple locks on the same key. What if only of the locks for

nit: "only one"

Separately, could we include the strength when constructing lastResolvedKey as well?


pkg/storage/mvcc.go line 5134 at r8 (raw file):

			return numKeys, numBytes, &roachpb.Span{Key: lastResolvedKey.Next(), EndKey: intentEndKey}, resumeReason, nil
		}
		ltEngineKey, err := ltIter.EngineKey()

[for me] Why can't we use LockTableKeyVersion here, like we do in MVCCResolveWriteIntent? Does it have something to do with iterator creation (prefix vs. not)?


pkg/storage/mvcc.go line 5156 at r8 (raw file):

		beforeBytes := rw.BufferedSize()
		var ok bool
		if ltKey.Strength == lock.Intent {

Is it worth pulling out a resolveLock function? Feel free to defer this until we implement lock resolution for other lock strengths.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 28 files at r8, 10 of 25 files at r9, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits line 52 at r9:

Previously, arulajmani (Arul Ajmani) wrote…

found and the iterator advances past them while seeking. In such cases,
we're back to paying the cost of scanning over the tombstones.

I don't fully follow this bit -- could you ELI5 for me?

if the exact key we are looking for is not there, the internal Pebble DELs that have not been compacted away have to be iterated over. See the first bullet in the commit message for d1c91e0 -- that optimization works when the key is actually there.


pkg/storage/mvcc.go line 5129 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: "only one"

Separately, could we include the strength when constructing lastResolvedKey as well?

The resume span and LockUpdate is already set up to use roachpb.Key so that would be a bigger change. And there is a maximum of 3 locks that the txn could have held on a roachpb.Key so fully processing a key is not going to exceed the limit significantly.


pkg/storage/mvcc.go line 5134 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

[for me] Why can't we use LockTableKeyVersion here, like we do in MVCCResolveWriteIntent? Does it have something to do with iterator creation (prefix vs. not)?

We also need ltKey.Key below which we don't know up front since we are resolving a whole range.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch from 2267f58 to a1ba790 Compare September 21, 2023 16:21
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pebble exposes a SeekGEWithLimit method, but this "limit" value is expressed as a key and not as a number of steps.

The problem that -WithLimit solves can't be expressed as a number of steps, right? My understanding is that it specifically helps in the case where the sought key doesn't exist, so the seek may need to next through an unbounded number of tombstones to arrive at the next live key.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani, @nvanbenschoten, and @sumeerbhola)

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The related problem that's relevant here is that NextWithLimit doesn't have a way to bound the amount of work performed before bailing. That would be useful for the kind of step-a-few-times-then-seek heuristics we have in a few places (including #110754), which balance the best-case cost of Next with the worst-case cost of Seek. If these heuristics are built above the pebble iterator, they don't have a way to limit the number of tombstones they encounter.

Have we considered pushing such a heuristic into Pebble? Something like a iter.SeekAfterNextWithLimit(key, limit).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani and @sumeerbhola)


pkg/storage/mvcc.go line 4574 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: s/the instructions/instructions?

Done.


pkg/storage/mvcc.go line 4578 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: while we're here, could you add a new line above REQUIRES please?

Done.


pkg/storage/mvcc.go line 5129 at r8 (raw file):

Previously, sumeerbhola wrote…

The resume span and LockUpdate is already set up to use roachpb.Key so that would be a bigger change. And there is a maximum of 3 locks that the txn could have held on a roachpb.Key so fully processing a key is not going to exceed the limit significantly.

Yeah, I think it's cleanest to just keep this key-oriented. But it's a TODO I need to address in #110480 before merging that change.


pkg/storage/mvcc.go line 5156 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Is it worth pulling out a resolveLock function? Feel free to defer this until we implement lock resolution for other lock strengths.

Yep, see #110480 😃

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch from a1ba790 to c01eeee Compare September 21, 2023 18:34
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered pushing such a heuristic into Pebble? Something like a iter.SeekAfterNextWithLimit(key, limit).

SeekGEWithLimit will transparently try to next internally for repeated seeks with monotonically increasing keys, but it has limitations. See:
https://github.com/cockroachdb/pebble/blob/bb00504445221cbb49864fea1277905a5ee518dc/iterator.go#L1207-L1249 https://github.com/cockroachdb/pebble/blob/bb00504445221cbb49864fea1277905a5ee518dc/sstable/reader_iter_single_lvl.go#L669-L696

I think this is what motivated @sumeerbhola's three-seek suggestion, but I'm not sure if any of the TrySeekUsingNext limitations apply here.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani and @sumeerbhola)

Informs cockroachdb#109648.

This commit implements intent resolution (point and ranged) using a
`LockTableIterator`, configured to return all locks for the transaction
being resolved and no locks from other transactions. This is the first
step towards releasing replicated locks during intent resolution.

While switching to a LockTableIterator, the commit is also able to
remove separatedIntentAndVersionIter, iterForKeyVersions and
mvccGetIntent, which were all used to avoid handing an MVCCMetadata
directly to mvccResolveWriteIntent. Instead of continuing to treat
intents as interleaved, we switch to handling intents entirely
separately from their provisional value during intent resolution, which
avoids jumping through these hoops and makes the code simpler.

The change to `TestMVCCResolveTxnRangeResumeWithManyVersions` is
immaterial and has to do with the transaction ID filter being applied
before the key limit (inside LockTableIterator), instead of after. The
new behavior is actually better.

----

One concern I have about this change is that it removes the call to
`SeekIntentGE` in `MVCCResolveWriteIntent`, which was added in d1c91e0
to guard against the case where many pebble tombstones from prior
intents on a key surround the intent being resolved. Conceptually, we'd
like to push optimizations that avoid scanning over these tombstones
into the `LockTableIterator` like we plan to do for skipping over
non-conflicting locks. Doing so would benefit all lock strengths. It
would also benefit the case where an intent is not found and the seek
hits tombstones from prior intents on later versions.

However, it's not clear how to do this with the current Pebble API.
Pebble exposes a `SeekGEWithLimit` method, but this "limit" value is
expressed as a key and not as a number of steps. How would we construct
a limit key to bound the number of tombstones a seek observes before
seeking directly to a specific (txn_id, lock_strength) version?

One option would be to seek to specific versions in the
`LockTableIterator` when advancing the iterator in cases where the
iterator is configured to match a specific txn ID. For example,
performing the following translations:
```
SeekGE({Key: k}) -> SeekGE({Key: k, Strength: Intent, TxnID: <txn_id>})
Next()           -> SeekGE({Key: k, Strength: Exclusive, TxnID: <txn_id>})
Next()           -> SeekGE({Key: k, Strength: Shared, TxnID: <txn_id>})
```
Of course, this gets more complicated when some of these locks are not
found and the iterator advances past them while seeking. In such cases,
we're back to paying the cost of scanning over the tombstones.

If we knew which lock strengths we had acquired on a key, we could avoid
some of this cost, but that would require API changes and client buy-in
to track lock spans on a per-strength basis.

I'll capture the impact of this change on the following benchmarks and
evaluate:
* BenchmarkIntentResolution
* BenchmarkIntentRangeResolution
* BenchmarkIntentScan

Release note: Nonet
Now that the specialized method is no longer used in
`MVCCResolveWriteIntent`, we can delete it.

Epic: None
Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch from c01eeee to 4e46443 Compare September 23, 2023 17:03
@nvanbenschoten
Copy link
Member Author

I'll capture the impact of this change on the following benchmarks and evaluate

Here is the impact that this change has on the benchmarks added in d1c91e0. We see that it's non-impactful except for the most degenerate cases, as expected:

name                                                  old time/op  new time/op   delta
IntentScan/versions=10/percent-flushed=0-30           2.84µs ± 0%   2.82µs ± 0%    -0.78%  (p=0.000 n=10+9)
IntentScan/versions=10/percent-flushed=50-30          1.61µs ± 0%   1.60µs ± 0%    -1.19%  (p=0.000 n=10+10)
IntentScan/versions=10/percent-flushed=80-30          1.14µs ± 0%   1.13µs ± 0%    -1.21%  (p=0.000 n=10+9)
IntentScan/versions=10/percent-flushed=90-30           943ns ± 0%    934ns ± 0%    -0.95%  (p=0.000 n=10+10)
IntentScan/versions=10/percent-flushed=100-30          895ns ± 0%    888ns ± 0%    -0.75%  (p=0.000 n=10+10)
IntentScan/versions=100/percent-flushed=0-30          33.1µs ± 1%   31.8µs ± 2%    -4.15%  (p=0.000 n=8+10)
IntentScan/versions=100/percent-flushed=50-30         8.56µs ± 4%   8.37µs ± 3%      ~     (p=0.050 n=9+9)
IntentScan/versions=100/percent-flushed=80-30         4.29µs ± 2%   4.28µs ± 2%      ~     (p=0.796 n=10+10)
IntentScan/versions=100/percent-flushed=90-30         2.66µs ± 1%   2.65µs ± 2%      ~     (p=0.056 n=9+10)
IntentScan/versions=100/percent-flushed=100-30         975ns ± 1%    970ns ± 0%    -0.53%  (p=0.001 n=10+9)
IntentScan/versions=200/percent-flushed=0-30          67.5µs ± 1%   65.0µs ± 1%    -3.66%  (p=0.000 n=10+10)
IntentScan/versions=200/percent-flushed=50-30         20.3µs ± 1%   20.1µs ± 8%      ~     (p=0.400 n=9+10)
IntentScan/versions=200/percent-flushed=80-30         7.13µs ± 6%   6.87µs ± 2%    -3.65%  (p=0.003 n=10+10)
IntentScan/versions=200/percent-flushed=90-30         4.26µs ± 3%   4.24µs ± 2%      ~     (p=0.393 n=10+10)
IntentScan/versions=200/percent-flushed=100-30        1.04µs ± 1%   1.04µs ± 1%      ~     (p=0.135 n=10+10)
IntentScan/versions=400/percent-flushed=0-30          33.6µs ± 3%   33.0µs ± 2%    -1.89%  (p=0.001 n=10+9)
IntentScan/versions=400/percent-flushed=50-30         43.9µs ± 1%   43.2µs ± 1%    -1.66%  (p=0.002 n=10+10)
IntentScan/versions=400/percent-flushed=80-30         15.3µs ± 4%   15.4µs ± 7%      ~     (p=0.684 n=10+10)
IntentScan/versions=400/percent-flushed=90-30         7.14µs ± 2%   7.09µs ± 4%      ~     (p=0.161 n=9+9)
IntentScan/versions=400/percent-flushed=100-30        1.77µs ± 2%   1.75µs ± 1%    -0.88%  (p=0.023 n=9+9)
IntentResolution/versions=10/percent-flushed=0-30     3.60µs ± 1%   3.80µs ± 1%    +5.56%  (p=0.000 n=10+10)
IntentResolution/versions=10/percent-flushed=50-30    3.16µs ± 2%   2.55µs ± 1%   -19.27%  (p=0.000 n=10+10)
IntentResolution/versions=10/percent-flushed=80-30    2.61µs ± 1%   2.29µs ± 2%   -12.41%  (p=0.000 n=10+10)
IntentResolution/versions=10/percent-flushed=90-30    2.55µs ± 1%   2.07µs ± 1%   -18.69%  (p=0.000 n=10+9)
IntentResolution/versions=10/percent-flushed=100-30   2.44µs ± 1%   1.98µs ± 1%   -19.04%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=0-30    10.5µs ± 5%   18.2µs ± 3%   +73.37%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=50-30   4.54µs ± 1%   6.30µs ± 2%   +38.76%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=80-30   4.28µs ± 2%   3.84µs ± 1%   -10.26%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=90-30   3.90µs ± 1%   3.12µs ± 1%   -19.87%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=100-30  2.61µs ± 1%   2.03µs ± 1%   -22.08%  (p=0.000 n=10+9)
IntentResolution/versions=200/percent-flushed=0-30    14.8µs ± 4%   35.9µs ± 2%  +143.54%  (p=0.000 n=9+10)
IntentResolution/versions=200/percent-flushed=50-30   4.86µs ± 2%  10.53µs ± 5%  +116.63%  (p=0.000 n=10+10)
IntentResolution/versions=200/percent-flushed=80-30   4.60µs ± 2%   5.59µs ± 1%   +21.50%  (p=0.000 n=10+9)
IntentResolution/versions=200/percent-flushed=90-30   4.35µs ± 2%   3.90µs ± 1%   -10.32%  (p=0.000 n=10+10)
IntentResolution/versions=200/percent-flushed=100-30  2.65µs ± 1%   2.07µs ± 1%   -21.89%  (p=0.000 n=10+10)
IntentResolution/versions=400/percent-flushed=0-30    5.10µs ± 2%  16.54µs ± 3%  +224.50%  (p=0.000 n=10+10)
IntentResolution/versions=400/percent-flushed=50-30   5.16µs ± 2%  22.19µs ± 2%  +329.77%  (p=0.000 n=10+10)
IntentResolution/versions=400/percent-flushed=80-30   4.92µs ± 0%   8.74µs ± 3%   +77.82%  (p=0.000 n=8+10)
IntentResolution/versions=400/percent-flushed=90-30   4.67µs ± 1%   5.60µs ± 1%   +19.80%  (p=0.000 n=10+10)
IntentResolution/versions=400/percent-flushed=100-30  2.83µs ± 1%   2.10µs ± 1%   -25.91%  (p=0.000 n=10+10)

Given that these cases impact the front half of mutations regardless of what we do during intent resolution, I'm ok with the limited regressions on this microbenchmark. Still, to minimize the impact, I re-introduced half of the previous optimization into MVCCResolveWriteIntent by initially seeking to the intent version instead of the version prefix. I also left a comment explaining the opportunity to optimize further if we track the strengths that locks on a given key are held with, with a reference back to d1c91e0. For now, I'll avoid optimizing further until we see this come up in real-world workloads.

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 23, 2023

Build succeeded:

@craig craig bot merged commit 43326b2 into cockroachdb:master Sep 23, 2023
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/releaseReplLocks branch September 23, 2023 19:45
@nvanbenschoten
Copy link
Member Author

For completeness, here's the impact of this change on BenchmarkIntentRangeResolution:

name                                                                                               old time/op  new time/op  delta
IntentRangeResolution/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=0-30         256µs ± 1%   263µs ± 2%   +2.95%  (p=0.000 n=10+10)
IntentRangeResolution/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=50-30        117µs ± 1%   124µs ± 1%   +6.11%  (p=0.000 n=10+9)
IntentRangeResolution/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=100-30      68.0µs ± 2%  75.4µs ± 2%  +10.89%  (p=0.000 n=10+10)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=0-30       210µs ± 1%   208µs ± 1%   -0.78%  (p=0.000 n=10+9)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=50-30     51.6µs ± 1%  51.5µs ± 1%     ~     (p=0.905 n=9+10)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=100-30    2.19µs ± 4%  2.28µs ± 2%   +4.09%  (p=0.000 n=10+9)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=0-30        237µs ± 1%   203µs ± 0%  -14.13%  (p=0.000 n=9+10)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=50-30      81.9µs ± 1%  50.7µs ± 1%  -38.11%  (p=0.000 n=9+9)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=100-30     42.1µs ± 1%  12.4µs ± 1%  -70.53%  (p=0.000 n=9+10)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=0-30      209µs ± 1%   206µs ± 1%   -1.72%  (p=0.000 n=10+9)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=50-30    49.7µs ± 1%  49.8µs ± 1%     ~     (p=0.447 n=10+9)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=100-30   1.34µs ± 3%  1.32µs ± 4%     ~     (p=0.051 n=9+10)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=0-30       236µs ± 1%   202µs ± 1%  -14.17%  (p=0.000 n=10+10)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=50-30     81.8µs ± 1%  49.6µs ± 1%  -39.32%  (p=0.000 n=10+10)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=100-30    41.8µs ± 1%  11.5µs ± 5%  -72.35%  (p=0.000 n=9+10)
IntentRangeResolution/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=0-30       2.77ms ± 1%  2.77ms ± 2%     ~     (p=0.604 n=9+10)
IntentRangeResolution/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=50-30       717µs ± 4%   735µs ± 4%   +2.57%  (p=0.035 n=10+10)
IntentRangeResolution/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=100-30     67.3µs ± 1%  74.6µs ± 1%  +10.85%  (p=0.000 n=9+8)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=0-30     2.74ms ± 1%  2.75ms ± 2%     ~     (p=0.481 n=10+10)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=50-30     651µs ± 6%   632µs ± 3%   -2.99%  (p=0.028 n=10+9)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=100-30   2.13µs ± 0%  2.28µs ± 2%   +7.08%  (p=0.000 n=8+10)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=0-30      2.74ms ± 1%  2.73ms ± 2%     ~     (p=0.247 n=10+10)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=50-30      674µs ± 2%   622µs ± 2%   -7.64%  (p=0.000 n=10+9)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=100-30    39.5µs ± 0%  12.4µs ± 1%  -68.70%  (p=0.000 n=8+9)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=0-30    2.73ms ± 2%  2.72ms ± 2%     ~     (p=0.739 n=10+10)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=50-30    639µs ± 6%   628µs ± 6%     ~     (p=0.280 n=10+10)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=100-30  1.35µs ± 3%  1.32µs ± 2%   -2.82%  (p=0.000 n=10+9)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=true/percent-flushed=0-30     2.76ms ± 2%  2.70ms ± 1%   -2.15%  (p=0.000 n=10+9)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=true/percent-flushed=50-30     670µs ± 3%   620µs ± 4%   -7.44%  (p=0.000 n=9+10)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=true/percent-flushed=100-30   39.1µs ± 1%  11.5µs ± 1%  -70.63%  (p=0.000 n=8+10)
IntentRangeResolution/versions=400/sparseness=1/other-txn-intents=false/percent-flushed=0-30       3.06ms ± 1%  3.07ms ± 2%     ~     (p=0.315 n=10+10)
IntentRangeResolution/versions=400/sparseness=1/other-txn-intents=false/percent-flushed=50-30      4.08ms ± 1%  4.08ms ± 1%     ~     (p=0.842 n=9+10)
IntentRangeResolution/versions=400/sparseness=1/other-txn-intents=false/percent-flushed=100-30     70.8µs ± 2%  78.3µs ± 2%  +10.63%  (p=0.000 n=10+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=false/percent-flushed=0-30     2.99ms ± 2%  2.99ms ± 3%     ~     (p=0.739 n=10+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=false/percent-flushed=50-30    3.99ms ± 2%  4.00ms ± 2%     ~     (p=0.436 n=10+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=false/percent-flushed=100-30   2.20µs ± 3%  2.32µs ± 2%   +5.55%  (p=0.000 n=9+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=true/percent-flushed=0-30      2.99ms ± 1%  2.95ms ± 2%   -1.57%  (p=0.000 n=10+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=true/percent-flushed=50-30     4.03ms ± 2%  3.96ms ± 2%   -1.81%  (p=0.004 n=10+9)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=true/percent-flushed=100-30    43.2µs ± 2%  12.1µs ± 1%  -71.91%  (p=0.000 n=9+10)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=false/percent-flushed=0-30    2.98ms ± 2%  2.95ms ± 1%     ~     (p=0.052 n=10+10)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=false/percent-flushed=50-30   3.94ms ± 0%  3.93ms ± 1%     ~     (p=0.481 n=8+9)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=false/percent-flushed=100-30  1.37µs ± 2%  1.33µs ± 0%   -2.95%  (p=0.000 n=9+8)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=true/percent-flushed=0-30     3.00ms ± 1%  2.95ms ± 3%   -1.92%  (p=0.001 n=10+10)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=true/percent-flushed=50-30    3.99ms ± 1%  3.94ms ± 2%   -1.24%  (p=0.006 n=9+10)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=true/percent-flushed=100-30   42.8µs ± 3%  11.2µs ± 2%  -73.74%  (p=0.000 n=9+9)

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Sep 29, 2023
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Oct 2, 2023
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
craig bot pushed a commit that referenced this pull request Oct 2, 2023
110754: storage: implement iter-before-seek optimization for LockTableIterator r=nvanbenschoten a=nvanbenschoten

Informs #100193.

This commit addresses a TODO left by #110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in #110324.

Release note: None

111126: pkg/util/log: introduce new metrics to the logging package r=knz a=abarganier

See individual commit messages to review.

This PR introduces two new metrics to the logging package:
- `log.messages.count`
- `log.buffered.messages.dropped`

**log.messages.count**: This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes.

**log.buffered.messages.dropped**: Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new.

This PR also renames the metric `fluent.sink.conn.errors` to `log.fluent.sink.conn.errors`, for consistency.

Fixes: #72453

----

Release note (ops change): This patch sets the Metric Type
on the metric `log.fluent.sink.conn.errors`. Previously, the
Metric Type was incorrectly left unset.

Note that this is simply an update to the metric's metadata.
The behavior and purpose of the metric remains unchanged.

----

Release note (ops change): This patch introduces the metric,
`log.messages.count`.

This metric measures the count of messages logged on the
node since startup. Note that this does not measure the
fan-out of single log messages to the various configured
logging sinks.

This metric can be helpful in understanding log rates and
volumes.

----

Release note (ops change): This patch introduces a new metric,
`log.buffered.messages.dropped`.

Buffered network logging sinks have a `max-buffer-size` attribute,
which determines, in bytes, how many log messages can be buffered.
Any `fluent-server` or `http-server` log sink that makes use of
a `buffering` attribute in its configuration (enabled by default)
qualifies as a buffered network logging sink.

If this buffer becomes full, and an additional log message is sent
to the buffered log sing, the buffer would exceed this
`max-buffer-size`. Therefore, the buffered log sink drops older
messages in the buffer to handle, in order to make room for the new.

`log.buffered.messages.dropped` counts the number of messages
dropped from the buffer. Note that the count is shared across all
buffered logging sinks.

111603: server: always START SERVICE SHARED in testserver.StartSharedProcessTenant() r=knz a=msbutler

Previously, StartSharedProcessTenant() would hang if it were run on a tenant that was created by a replication stream. This patch fixes this bug by ensuring `ALTER TENANT $1 START SERVICE SHARED` is run even if the tenant was already created.

Epic: none

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
THardy98 pushed a commit to THardy98/cockroach that referenced this pull request Oct 6, 2023
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants