storage: implement intent resolution using LockTableIterator #110324

nvanbenschoten · 2023-09-11T02:29:20Z

This commit implements intent resolution (point and ranged) using a LockTableIterator, configured to return all locks for the transaction being resolved and no locks from other transactions. This is the first step towards releasing replicated locks during intent resolution.

While switching to a LockTableIterator, the commit is also able to remove separatedIntentAndVersionIter, iterForKeyVersions and mvccGetIntent, which were all used to avoid handing an MVCCMetadata directly to mvccResolveWriteIntent. Instead of continuing to treat intents as interleaved, we switch to handling intents entirely separately from their provisional value during intent resolution, which avoids jumping through these hoops and makes the code simpler.

The change to TestMVCCResolveTxnRangeResumeWithManyVersions is immaterial and has to do with the transaction ID filter being applied before the key limit (inside LockTableIterator), instead of after. The new behavior is actually better.

One concern I have about this change is that it removes the call to SeekIntentGE in MVCCResolveWriteIntent, which was added in d1c91e0 to guard against the case where many pebble tombstones from prior intents from different txns on a key surround the intent being resolved. Conceptually, we'd like to push optimizations that avoid scanning over these tombstones into the LockTableIterator like we plan to do for skipping over non-conflicting locks. Doing so would benefit all lock strengths. It would also benefit the case where an intent is not found and the seek hits tombstones from prior intents on later versions.

However, it's not clear how to do this with the current Pebble API. Pebble exposes a SeekGEWithLimit method, but this "limit" value is expressed as a key and not as a number of steps. How would we construct a limit key to bound the number of tombstones a seek observes before seeking directly to a specific (txn_id, lock_strength) version?

One option would be to seek to specific versions in the LockTableIterator when advancing the iterator in cases where the iterator is configured to match a specific txn ID. For example, performing the following translations:

SeekGE({Key: k}) -> SeekGE({Key: k, Strength: Intent, TxnID: <txn_id>})
Next()           -> SeekGE({Key: k, Strength: Exclusive, TxnID: <txn_id>})
Next()           -> SeekGE({Key: k, Strength: Shared, TxnID: <txn_id>})

Of course, this gets more complicated when some of these locks are not found and the iterator advances past them while seeking. In such cases, we're back to paying the cost of scanning over the tombstones.

If we knew which lock strengths we had acquired on a key, we could avoid some of this cost, but that would require API changes and client buy-in to track lock spans on a per-strength basis.

I'll capture the impact of this change on the following benchmarks and evaluate:

BenchmarkIntentResolution
BenchmarkIntentRangeResolution
BenchmarkIntentScan

Release note: None

cockroach-teamcity · 2023-09-11T02:29:35Z

This change is

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

-- commits line 146 at r6:
Is the problem that the LockUpdate does not tell us what locks were held by this txn on the span?
What if we tracked spans for each lock strength in the txn record?

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

-- commits line 146 at r6:

Previously, sumeerbhola wrote…

Is the problem that the LockUpdate does not tell us what locks were held by this txn on the span?
What if we tracked spans for each lock strength in the txn record?

I can't remember whether this optimization was prompted by only microbenchmarks. The second bullet in the commit comment is not optimized even before separated intents. Perhaps I was being extra sensitive to intent resolution performance at that time, given intent resolution latency is important for lowering contention.

Another possibility would be to do a series of multiple SeekGEWithLimit calls for each of the strengths. We don't fully optimize this monotonic seek case in pebble.Iterator but it is not hard to change that to use next calls under the hood. We would still need to produce a limit key, which is a bit problematic since it needs to be a valid EngineKey for cmp to work and needs to be exclusive. If we can efficiently construct the immediately smaller txnid, then we can produce a limit key.

nvanbenschoten · 2023-09-15T23:24:06Z

Is the problem that the LockUpdate does not tell us what locks were held by this txn on the span?
What if we tracked spans for each lock strength in the txn record?

Right, we don't currently track which locks with which strengths are held on which keys, only that there are locks with some strength held on certain keys. Tracking the strength of locks in the coordinator and then storing locks with their strength in the txn record would let us perform more precise resolution.

I can't remember whether this optimization was prompted by only microbenchmarks.

From what I was able to dig up, it sounds like one of the SQL logic tests that performed ranged intent resolution became significantly slower with separated intents. I don't see any mention of single-key resolution being so slow that we had to optimize it. It makes me wonder whether the first bullet in the commit message was an opportunistic improvement.

I'm running the benchmarks introduced in that commit over this PR to get a better sense for the impact.

Another possibility would be to do a series of multiple SeekGEWithLimit calls for each of the strengths. We don't fully optimize this monotonic seek case in pebble.Iterator but it is not hard to change that to use next calls under the hood. We would still need to produce a limit key, which is a bit problematic since it needs to be a valid EngineKey for cmp to work and needs to be exclusive. If we can efficiently construct the immediately smaller txnid, then we can produce a limit key.

I've been considering this. I would improve the worst-case performance of single-key resolution in exchange for a more expensive common case (3 seeks instead of 1). I wonder if that's worth it. We already scan over all locks on a key (including tombstones) when writing these intents, so the degenerate quadratic case where a single key is written and resolved in fast succession is quadratic no matter what we do here.

Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None

arulajmani

I'm assuming we want to bottom out on the tombstones discussion before we merge -- I don't have too much to add, but I'll follow along.

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, 5 of 5 files at r3, 2 of 2 files at r4, 7 of 7 files at r5, 1 of 5 files at r6, 28 of 28 files at r8, 25 of 25 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

-- commits line 52 at r9:

found and the iterator advances past them while seeking. In such cases,
we're back to paying the cost of scanning over the tombstones.

I don't fully follow this bit -- could you ELI5 for me?

pkg/storage/mvcc.go line 4574 at r8 (raw file):

// mvccResolveWriteIntent is the core logic for resolving an intent. The
// function accepts the instructions for how to resolve the intent (encoded in

nit: s/the instructions/instructions?

pkg/storage/mvcc.go line 4578 at r8 (raw file):

// Returns whether the provided intent was resolved (true) or whether the
// resolution was a no-op (false).
// REQUIRES: iter surfaces range keys via IterKeyTypePointsAndRanges.

nit: while we're here, could you add a new line above REQUIRES please?

pkg/storage/mvcc.go line 5129 at r8 (raw file):

			// We could also compute a tighter nextKey here if we wanted to.
			// TODO(nvanbenschoten): this resumeSpan won't be correct if there
			// are multiple locks on the same key. What if only of the locks for

nit: "only one"

Separately, could we include the strength when constructing lastResolvedKey as well?

pkg/storage/mvcc.go line 5134 at r8 (raw file):

			return numKeys, numBytes, &roachpb.Span{Key: lastResolvedKey.Next(), EndKey: intentEndKey}, resumeReason, nil
		}
		ltEngineKey, err := ltIter.EngineKey()

[for me] Why can't we use LockTableKeyVersion here, like we do in MVCCResolveWriteIntent? Does it have something to do with iterator creation (prefix vs. not)?

pkg/storage/mvcc.go line 5156 at r8 (raw file):

		beforeBytes := rw.BufferedSize()
		var ok bool
		if ltKey.Strength == lock.Intent {

Is it worth pulling out a resolveLock function? Feel free to defer this until we implement lock resolution for other lock strengths.

sumeerbhola

Reviewed 4 of 28 files at r8, 10 of 25 files at r9, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

-- commits line 52 at r9:

Previously, arulajmani (Arul Ajmani) wrote…

found and the iterator advances past them while seeking. In such cases,
we're back to paying the cost of scanning over the tombstones.

I don't fully follow this bit -- could you ELI5 for me?

if the exact key we are looking for is not there, the internal Pebble DELs that have not been compacted away have to be iterated over. See the first bullet in the commit message for d1c91e0 -- that optimization works when the key is actually there.

pkg/storage/mvcc.go line 5129 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: "only one"

Separately, could we include the strength when constructing lastResolvedKey as well?

The resume span and LockUpdate is already set up to use roachpb.Key so that would be a bigger change. And there is a maximum of 3 locks that the txn could have held on a roachpb.Key so fully processing a key is not going to exceed the limit significantly.

pkg/storage/mvcc.go line 5134 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

[for me] Why can't we use LockTableKeyVersion here, like we do in MVCCResolveWriteIntent? Does it have something to do with iterator creation (prefix vs. not)?

We also need ltKey.Key below which we don't know up front since we are resolving a whole range.

jbowens

Pebble exposes a SeekGEWithLimit method, but this "limit" value is expressed as a key and not as a number of steps.

The problem that -WithLimit solves can't be expressed as a number of steps, right? My understanding is that it specifically helps in the case where the sought key doesn't exist, so the seek may need to next through an unbounded number of tombstones to arrive at the next live key.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani, @nvanbenschoten, and @sumeerbhola)

nvanbenschoten

The related problem that's relevant here is that NextWithLimit doesn't have a way to bound the amount of work performed before bailing. That would be useful for the kind of step-a-few-times-then-seek heuristics we have in a few places (including #110754), which balance the best-case cost of Next with the worst-case cost of Seek. If these heuristics are built above the pebble iterator, they don't have a way to limit the number of tombstones they encounter.

Have we considered pushing such a heuristic into Pebble? Something like a iter.SeekAfterNextWithLimit(key, limit).

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani and @sumeerbhola)

pkg/storage/mvcc.go line 4574 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: s/the instructions/instructions?

Done.

pkg/storage/mvcc.go line 4578 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: while we're here, could you add a new line above REQUIRES please?

Done.

pkg/storage/mvcc.go line 5129 at r8 (raw file):

Previously, sumeerbhola wrote…

The resume span and LockUpdate is already set up to use roachpb.Key so that would be a bigger change. And there is a maximum of 3 locks that the txn could have held on a roachpb.Key so fully processing a key is not going to exceed the limit significantly.

Yeah, I think it's cleanest to just keep this key-oriented. But it's a TODO I need to address in #110480 before merging that change.

pkg/storage/mvcc.go line 5156 at r8 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Is it worth pulling out a resolveLock function? Feel free to defer this until we implement lock resolution for other lock strengths.

Yep, see #110480 😃

jbowens

Have we considered pushing such a heuristic into Pebble? Something like a iter.SeekAfterNextWithLimit(key, limit).

SeekGEWithLimit will transparently try to next internally for repeated seeks with monotonically increasing keys, but it has limitations. See:
https://github.com/cockroachdb/pebble/blob/bb00504445221cbb49864fea1277905a5ee518dc/iterator.go#L1207-L1249 https://github.com/cockroachdb/pebble/blob/bb00504445221cbb49864fea1277905a5ee518dc/sstable/reader_iter_single_lvl.go#L669-L696

I think this is what motivated @sumeerbhola's three-seek suggestion, but I'm not sure if any of the TrySeekUsingNext limitations apply here.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @arulajmani and @sumeerbhola)

Informs cockroachdb#109648. This commit implements intent resolution (point and ranged) using a `LockTableIterator`, configured to return all locks for the transaction being resolved and no locks from other transactions. This is the first step towards releasing replicated locks during intent resolution. While switching to a LockTableIterator, the commit is also able to remove separatedIntentAndVersionIter, iterForKeyVersions and mvccGetIntent, which were all used to avoid handing an MVCCMetadata directly to mvccResolveWriteIntent. Instead of continuing to treat intents as interleaved, we switch to handling intents entirely separately from their provisional value during intent resolution, which avoids jumping through these hoops and makes the code simpler. The change to `TestMVCCResolveTxnRangeResumeWithManyVersions` is immaterial and has to do with the transaction ID filter being applied before the key limit (inside LockTableIterator), instead of after. The new behavior is actually better. ---- One concern I have about this change is that it removes the call to `SeekIntentGE` in `MVCCResolveWriteIntent`, which was added in d1c91e0 to guard against the case where many pebble tombstones from prior intents on a key surround the intent being resolved. Conceptually, we'd like to push optimizations that avoid scanning over these tombstones into the `LockTableIterator` like we plan to do for skipping over non-conflicting locks. Doing so would benefit all lock strengths. It would also benefit the case where an intent is not found and the seek hits tombstones from prior intents on later versions. However, it's not clear how to do this with the current Pebble API. Pebble exposes a `SeekGEWithLimit` method, but this "limit" value is expressed as a key and not as a number of steps. How would we construct a limit key to bound the number of tombstones a seek observes before seeking directly to a specific (txn_id, lock_strength) version? One option would be to seek to specific versions in the `LockTableIterator` when advancing the iterator in cases where the iterator is configured to match a specific txn ID. For example, performing the following translations: ``` SeekGE({Key: k}) -> SeekGE({Key: k, Strength: Intent, TxnID: <txn_id>}) Next() -> SeekGE({Key: k, Strength: Exclusive, TxnID: <txn_id>}) Next() -> SeekGE({Key: k, Strength: Shared, TxnID: <txn_id>}) ``` Of course, this gets more complicated when some of these locks are not found and the iterator advances past them while seeking. In such cases, we're back to paying the cost of scanning over the tombstones. If we knew which lock strengths we had acquired on a key, we could avoid some of this cost, but that would require API changes and client buy-in to track lock spans on a per-strength basis. I'll capture the impact of this change on the following benchmarks and evaluate: * BenchmarkIntentResolution * BenchmarkIntentRangeResolution * BenchmarkIntentScan Release note: Nonet

Now that the specialized method is no longer used in `MVCCResolveWriteIntent`, we can delete it. Epic: None Release note: None

nvanbenschoten · 2023-09-23T17:44:11Z

I'll capture the impact of this change on the following benchmarks and evaluate

Here is the impact that this change has on the benchmarks added in d1c91e0. We see that it's non-impactful except for the most degenerate cases, as expected:

name                                                  old time/op  new time/op   delta
IntentScan/versions=10/percent-flushed=0-30           2.84µs ± 0%   2.82µs ± 0%    -0.78%  (p=0.000 n=10+9)
IntentScan/versions=10/percent-flushed=50-30          1.61µs ± 0%   1.60µs ± 0%    -1.19%  (p=0.000 n=10+10)
IntentScan/versions=10/percent-flushed=80-30          1.14µs ± 0%   1.13µs ± 0%    -1.21%  (p=0.000 n=10+9)
IntentScan/versions=10/percent-flushed=90-30           943ns ± 0%    934ns ± 0%    -0.95%  (p=0.000 n=10+10)
IntentScan/versions=10/percent-flushed=100-30          895ns ± 0%    888ns ± 0%    -0.75%  (p=0.000 n=10+10)
IntentScan/versions=100/percent-flushed=0-30          33.1µs ± 1%   31.8µs ± 2%    -4.15%  (p=0.000 n=8+10)
IntentScan/versions=100/percent-flushed=50-30         8.56µs ± 4%   8.37µs ± 3%      ~     (p=0.050 n=9+9)
IntentScan/versions=100/percent-flushed=80-30         4.29µs ± 2%   4.28µs ± 2%      ~     (p=0.796 n=10+10)
IntentScan/versions=100/percent-flushed=90-30         2.66µs ± 1%   2.65µs ± 2%      ~     (p=0.056 n=9+10)
IntentScan/versions=100/percent-flushed=100-30         975ns ± 1%    970ns ± 0%    -0.53%  (p=0.001 n=10+9)
IntentScan/versions=200/percent-flushed=0-30          67.5µs ± 1%   65.0µs ± 1%    -3.66%  (p=0.000 n=10+10)
IntentScan/versions=200/percent-flushed=50-30         20.3µs ± 1%   20.1µs ± 8%      ~     (p=0.400 n=9+10)
IntentScan/versions=200/percent-flushed=80-30         7.13µs ± 6%   6.87µs ± 2%    -3.65%  (p=0.003 n=10+10)
IntentScan/versions=200/percent-flushed=90-30         4.26µs ± 3%   4.24µs ± 2%      ~     (p=0.393 n=10+10)
IntentScan/versions=200/percent-flushed=100-30        1.04µs ± 1%   1.04µs ± 1%      ~     (p=0.135 n=10+10)
IntentScan/versions=400/percent-flushed=0-30          33.6µs ± 3%   33.0µs ± 2%    -1.89%  (p=0.001 n=10+9)
IntentScan/versions=400/percent-flushed=50-30         43.9µs ± 1%   43.2µs ± 1%    -1.66%  (p=0.002 n=10+10)
IntentScan/versions=400/percent-flushed=80-30         15.3µs ± 4%   15.4µs ± 7%      ~     (p=0.684 n=10+10)
IntentScan/versions=400/percent-flushed=90-30         7.14µs ± 2%   7.09µs ± 4%      ~     (p=0.161 n=9+9)
IntentScan/versions=400/percent-flushed=100-30        1.77µs ± 2%   1.75µs ± 1%    -0.88%  (p=0.023 n=9+9)
IntentResolution/versions=10/percent-flushed=0-30     3.60µs ± 1%   3.80µs ± 1%    +5.56%  (p=0.000 n=10+10)
IntentResolution/versions=10/percent-flushed=50-30    3.16µs ± 2%   2.55µs ± 1%   -19.27%  (p=0.000 n=10+10)
IntentResolution/versions=10/percent-flushed=80-30    2.61µs ± 1%   2.29µs ± 2%   -12.41%  (p=0.000 n=10+10)
IntentResolution/versions=10/percent-flushed=90-30    2.55µs ± 1%   2.07µs ± 1%   -18.69%  (p=0.000 n=10+9)
IntentResolution/versions=10/percent-flushed=100-30   2.44µs ± 1%   1.98µs ± 1%   -19.04%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=0-30    10.5µs ± 5%   18.2µs ± 3%   +73.37%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=50-30   4.54µs ± 1%   6.30µs ± 2%   +38.76%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=80-30   4.28µs ± 2%   3.84µs ± 1%   -10.26%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=90-30   3.90µs ± 1%   3.12µs ± 1%   -19.87%  (p=0.000 n=10+10)
IntentResolution/versions=100/percent-flushed=100-30  2.61µs ± 1%   2.03µs ± 1%   -22.08%  (p=0.000 n=10+9)
IntentResolution/versions=200/percent-flushed=0-30    14.8µs ± 4%   35.9µs ± 2%  +143.54%  (p=0.000 n=9+10)
IntentResolution/versions=200/percent-flushed=50-30   4.86µs ± 2%  10.53µs ± 5%  +116.63%  (p=0.000 n=10+10)
IntentResolution/versions=200/percent-flushed=80-30   4.60µs ± 2%   5.59µs ± 1%   +21.50%  (p=0.000 n=10+9)
IntentResolution/versions=200/percent-flushed=90-30   4.35µs ± 2%   3.90µs ± 1%   -10.32%  (p=0.000 n=10+10)
IntentResolution/versions=200/percent-flushed=100-30  2.65µs ± 1%   2.07µs ± 1%   -21.89%  (p=0.000 n=10+10)
IntentResolution/versions=400/percent-flushed=0-30    5.10µs ± 2%  16.54µs ± 3%  +224.50%  (p=0.000 n=10+10)
IntentResolution/versions=400/percent-flushed=50-30   5.16µs ± 2%  22.19µs ± 2%  +329.77%  (p=0.000 n=10+10)
IntentResolution/versions=400/percent-flushed=80-30   4.92µs ± 0%   8.74µs ± 3%   +77.82%  (p=0.000 n=8+10)
IntentResolution/versions=400/percent-flushed=90-30   4.67µs ± 1%   5.60µs ± 1%   +19.80%  (p=0.000 n=10+10)
IntentResolution/versions=400/percent-flushed=100-30  2.83µs ± 1%   2.10µs ± 1%   -25.91%  (p=0.000 n=10+10)

Given that these cases impact the front half of mutations regardless of what we do during intent resolution, I'm ok with the limited regressions on this microbenchmark. Still, to minimize the impact, I re-introduced half of the previous optimization into MVCCResolveWriteIntent by initially seeking to the intent version instead of the version prefix. I also left a comment explaining the opportunity to optimize further if we track the strengths that locks on a given key are held with, with a reference back to d1c91e0. For now, I'll avoid optimizing further until we see this come up in real-world workloads.

TFTRs!

bors r+

craig · 2023-09-23T18:33:09Z

Build succeeded:

Bazel Essential CI (Cockroach)

nvanbenschoten · 2023-09-25T01:16:04Z

For completeness, here's the impact of this change on BenchmarkIntentRangeResolution:

name                                                                                               old time/op  new time/op  delta
IntentRangeResolution/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=0-30         256µs ± 1%   263µs ± 2%   +2.95%  (p=0.000 n=10+10)
IntentRangeResolution/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=50-30        117µs ± 1%   124µs ± 1%   +6.11%  (p=0.000 n=10+9)
IntentRangeResolution/versions=10/sparseness=1/other-txn-intents=false/percent-flushed=100-30      68.0µs ± 2%  75.4µs ± 2%  +10.89%  (p=0.000 n=10+10)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=0-30       210µs ± 1%   208µs ± 1%   -0.78%  (p=0.000 n=10+9)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=50-30     51.6µs ± 1%  51.5µs ± 1%     ~     (p=0.905 n=9+10)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=false/percent-flushed=100-30    2.19µs ± 4%  2.28µs ± 2%   +4.09%  (p=0.000 n=10+9)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=0-30        237µs ± 1%   203µs ± 0%  -14.13%  (p=0.000 n=9+10)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=50-30      81.9µs ± 1%  50.7µs ± 1%  -38.11%  (p=0.000 n=9+9)
IntentRangeResolution/versions=10/sparseness=100/other-txn-intents=true/percent-flushed=100-30     42.1µs ± 1%  12.4µs ± 1%  -70.53%  (p=0.000 n=9+10)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=0-30      209µs ± 1%   206µs ± 1%   -1.72%  (p=0.000 n=10+9)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=50-30    49.7µs ± 1%  49.8µs ± 1%     ~     (p=0.447 n=10+9)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=false/percent-flushed=100-30   1.34µs ± 3%  1.32µs ± 4%     ~     (p=0.051 n=9+10)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=0-30       236µs ± 1%   202µs ± 1%  -14.17%  (p=0.000 n=10+10)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=50-30     81.8µs ± 1%  49.6µs ± 1%  -39.32%  (p=0.000 n=10+10)
IntentRangeResolution/versions=10/sparseness=1000/other-txn-intents=true/percent-flushed=100-30    41.8µs ± 1%  11.5µs ± 5%  -72.35%  (p=0.000 n=9+10)
IntentRangeResolution/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=0-30       2.77ms ± 1%  2.77ms ± 2%     ~     (p=0.604 n=9+10)
IntentRangeResolution/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=50-30       717µs ± 4%   735µs ± 4%   +2.57%  (p=0.035 n=10+10)
IntentRangeResolution/versions=100/sparseness=1/other-txn-intents=false/percent-flushed=100-30     67.3µs ± 1%  74.6µs ± 1%  +10.85%  (p=0.000 n=9+8)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=0-30     2.74ms ± 1%  2.75ms ± 2%     ~     (p=0.481 n=10+10)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=50-30     651µs ± 6%   632µs ± 3%   -2.99%  (p=0.028 n=10+9)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=false/percent-flushed=100-30   2.13µs ± 0%  2.28µs ± 2%   +7.08%  (p=0.000 n=8+10)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=0-30      2.74ms ± 1%  2.73ms ± 2%     ~     (p=0.247 n=10+10)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=50-30      674µs ± 2%   622µs ± 2%   -7.64%  (p=0.000 n=10+9)
IntentRangeResolution/versions=100/sparseness=100/other-txn-intents=true/percent-flushed=100-30    39.5µs ± 0%  12.4µs ± 1%  -68.70%  (p=0.000 n=8+9)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=0-30    2.73ms ± 2%  2.72ms ± 2%     ~     (p=0.739 n=10+10)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=50-30    639µs ± 6%   628µs ± 6%     ~     (p=0.280 n=10+10)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=false/percent-flushed=100-30  1.35µs ± 3%  1.32µs ± 2%   -2.82%  (p=0.000 n=10+9)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=true/percent-flushed=0-30     2.76ms ± 2%  2.70ms ± 1%   -2.15%  (p=0.000 n=10+9)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=true/percent-flushed=50-30     670µs ± 3%   620µs ± 4%   -7.44%  (p=0.000 n=9+10)
IntentRangeResolution/versions=100/sparseness=1000/other-txn-intents=true/percent-flushed=100-30   39.1µs ± 1%  11.5µs ± 1%  -70.63%  (p=0.000 n=8+10)
IntentRangeResolution/versions=400/sparseness=1/other-txn-intents=false/percent-flushed=0-30       3.06ms ± 1%  3.07ms ± 2%     ~     (p=0.315 n=10+10)
IntentRangeResolution/versions=400/sparseness=1/other-txn-intents=false/percent-flushed=50-30      4.08ms ± 1%  4.08ms ± 1%     ~     (p=0.842 n=9+10)
IntentRangeResolution/versions=400/sparseness=1/other-txn-intents=false/percent-flushed=100-30     70.8µs ± 2%  78.3µs ± 2%  +10.63%  (p=0.000 n=10+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=false/percent-flushed=0-30     2.99ms ± 2%  2.99ms ± 3%     ~     (p=0.739 n=10+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=false/percent-flushed=50-30    3.99ms ± 2%  4.00ms ± 2%     ~     (p=0.436 n=10+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=false/percent-flushed=100-30   2.20µs ± 3%  2.32µs ± 2%   +5.55%  (p=0.000 n=9+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=true/percent-flushed=0-30      2.99ms ± 1%  2.95ms ± 2%   -1.57%  (p=0.000 n=10+10)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=true/percent-flushed=50-30     4.03ms ± 2%  3.96ms ± 2%   -1.81%  (p=0.004 n=10+9)
IntentRangeResolution/versions=400/sparseness=100/other-txn-intents=true/percent-flushed=100-30    43.2µs ± 2%  12.1µs ± 1%  -71.91%  (p=0.000 n=9+10)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=false/percent-flushed=0-30    2.98ms ± 2%  2.95ms ± 1%     ~     (p=0.052 n=10+10)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=false/percent-flushed=50-30   3.94ms ± 0%  3.93ms ± 1%     ~     (p=0.481 n=8+9)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=false/percent-flushed=100-30  1.37µs ± 2%  1.33µs ± 0%   -2.95%  (p=0.000 n=9+8)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=true/percent-flushed=0-30     3.00ms ± 1%  2.95ms ± 3%   -1.92%  (p=0.001 n=10+10)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=true/percent-flushed=50-30    3.99ms ± 1%  3.94ms ± 2%   -1.24%  (p=0.006 n=9+10)
IntentRangeResolution/versions=400/sparseness=1000/other-txn-intents=true/percent-flushed=100-30   42.8µs ± 3%  11.2µs ± 2%  -73.74%  (p=0.000 n=9+9)

Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None

110754: storage: implement iter-before-seek optimization for LockTableIterator r=nvanbenschoten a=nvanbenschoten Informs #100193. This commit addresses a TODO left by #110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in #110324. Release note: None 111126: pkg/util/log: introduce new metrics to the logging package r=knz a=abarganier See individual commit messages to review. This PR introduces two new metrics to the logging package: - `log.messages.count` - `log.buffered.messages.dropped` **log.messages.count**: This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes. **log.buffered.messages.dropped**: Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new. This PR also renames the metric `fluent.sink.conn.errors` to `log.fluent.sink.conn.errors`, for consistency. Fixes: #72453 ---- Release note (ops change): This patch sets the Metric Type on the metric `log.fluent.sink.conn.errors`. Previously, the Metric Type was incorrectly left unset. Note that this is simply an update to the metric's metadata. The behavior and purpose of the metric remains unchanged. ---- Release note (ops change): This patch introduces the metric, `log.messages.count`. This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes. ---- Release note (ops change): This patch introduces a new metric, `log.buffered.messages.dropped`. Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new. `log.buffered.messages.dropped` counts the number of messages dropped from the buffer. Note that the count is shared across all buffered logging sinks. 111603: server: always START SERVICE SHARED in testserver.StartSharedProcessTenant() r=knz a=msbutler Previously, StartSharedProcessTenant() would hang if it were run on a tenant that was created by a replication stream. This patch fixes this bug by ensuring `ALTER TENANT $1 START SERVICE SHARED` is run even if the tenant was already created. Epic: none Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Michael Butler <[email protected]>

Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None

nvanbenschoten requested review from arulajmani and sumeerbhola September 11, 2023 02:29

nvanbenschoten requested review from a team as code owners September 11, 2023 02:29

nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch from 2bb94d4 to a111678 Compare September 11, 2023 14:13

sumeerbhola reviewed Sep 11, 2023

View reviewed changes

sumeerbhola requested changes Sep 11, 2023

View reviewed changes

nvanbenschoten mentioned this pull request Sep 12, 2023

storage: release replicated locks during intent resolution #110480

Merged

nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch 2 times, most recently from 7efccf0 to 2267f58 Compare September 15, 2023 16:18

nvanbenschoten mentioned this pull request Sep 15, 2023

storage: implement iter-before-seek optimization for LockTableIterator #110754

Merged

arulajmani approved these changes Sep 18, 2023

View reviewed changes

sumeerbhola approved these changes Sep 18, 2023

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch from 2267f58 to a1ba790 Compare September 21, 2023 16:21

jbowens reviewed Sep 21, 2023

View reviewed changes

nvanbenschoten commented Sep 21, 2023

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch from a1ba790 to c01eeee Compare September 21, 2023 18:34

jbowens reviewed Sep 21, 2023

View reviewed changes

nvanbenschoten added 2 commits September 23, 2023 13:03

storage: delete MVCCIterator.SeekIntentGE

4e46443

Now that the specialized method is no longer used in `MVCCResolveWriteIntent`, we can delete it. Epic: None Release note: None

nvanbenschoten force-pushed the nvanbenschoten/releaseReplLocks branch from c01eeee to 4e46443 Compare September 23, 2023 17:03

craig bot merged commit 43326b2 into cockroachdb:master Sep 23, 2023

nvanbenschoten deleted the nvanbenschoten/releaseReplLocks branch September 23, 2023 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: implement intent resolution using LockTableIterator #110324

storage: implement intent resolution using LockTableIterator #110324

nvanbenschoten commented Sep 11, 2023 •

edited

Loading

cockroach-teamcity commented Sep 11, 2023

sumeerbhola left a comment

sumeerbhola left a comment

nvanbenschoten commented Sep 15, 2023

arulajmani left a comment

sumeerbhola left a comment

jbowens left a comment

nvanbenschoten left a comment

jbowens left a comment

nvanbenschoten commented Sep 23, 2023

craig bot commented Sep 23, 2023

nvanbenschoten commented Sep 25, 2023

storage: implement intent resolution using LockTableIterator #110324

storage: implement intent resolution using LockTableIterator #110324

Conversation

nvanbenschoten commented Sep 11, 2023 • edited Loading

cockroach-teamcity commented Sep 11, 2023

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Sep 15, 2023

arulajmani left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Sep 23, 2023

craig bot commented Sep 23, 2023

nvanbenschoten commented Sep 25, 2023

nvanbenschoten commented Sep 11, 2023 •

edited

Loading