kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

andreimatei · 2018-08-21T01:30:58Z

@nvanbenschoten , the RETRY_ASYNC_WRITE_FAILURE error I was telling you about is very easy to repro on master, so I'm thinking maybe it's worth verifying that it indeed happens because of a lease transfer. Maybe there's something else too going on (based on nothing but the frequency).
I'm running something like bin/roachtest run interleavedpartitioned --count 100 reusing a cluster (so --cluster <name>, and more often then not the test fails within 20s or so. Looks like if it doesn't fail in the beginning, it won't fail later on either.
I've also changed the duration below to 30s to speed up the process:

cockroach/pkg/cmd/roachtest/interleavedpartitioned.go

Line 62 in 2e905a0

duration := " --duration " + ifLocal("10s", "10m")

What do you think? Feel free to close or pass back if you're not interested.

I've separately opened #28873 complaining that the test doesn't do retries.

The text was updated successfully, but these errors were encountered:

andreimatei · 2018-08-21T01:34:21Z

And also I wanted to say - if indeed the reason why we get more of these error now is that for pipelined writes there's no retry loop in the execWriteBatch, then couldn't we add that retry loop back (for reproposals, not for re-evaluations) even if the client is not waiting for the response?

nvanbenschoten · 2018-08-21T01:46:06Z

Yeah, this is worth an investigation to at least determine what the cause of these retryable errors is, thanks for filing.

For the record, the retry loop is still there for reproposals, just not for re-evaluations. I think we could try to revive some re-evaluation if we compared the result against the previous result, but that seems messy. A better approach is to make writes idempotent and re-issue (possibly synchronously) them if they are ever found to be missing.

nvanbenschoten · 2018-08-21T01:49:16Z

A better approach is to make writes idempotent and re-issue (possibly synchronously) them if they are ever found to be missing.

It might be true that we can do this even without idempotency if we can prove that a missing write implies that that write will never succeed. In that case, we could transparently retry from the client and continue without a client-side retry as long as the result is the same (not dissimilar to the refresh span optimization). The main downside to that is that we'd then need to hold on to BatchRequest for all outstanding writes on the client.

andreimatei · 2018-08-21T02:18:04Z

For the record, the retry loop is still there for reproposals, just not for re-evaluations

Yeah, I meant the other way around here; re-evaluations is what we'd want here. So why exactly can't the retry loop (not the client) perform re-evaluations too?

nvanbenschoten · 2018-08-27T15:17:20Z

So why exactly can't the retry loop (not the client) perform re-evaluations too?

Because once we leave the command queue, the re-evaluation might produce a different result.

Informs cockroachdb#28876. This change adds a new metric called `txn.restarts.asyncwritefailure`, along with a corresponding series to the admin ui's `KV Transaction Restarts` graph. These transaction restarts are caused by async consensus writes which fail while a transaction is attempting to pipeline writes. The restart occurs when the transaction visits the write's key while attempting to prove that the write succeeded and finds a missing intent. Release note: None

29499: kv/ui: add metric and graph for AsyncWriteFailure restarts r=nvanbenschoten a=nvanbenschoten Informs #28876. This change adds a new metric called `txn.restarts.asyncwritefailure`, along with a corresponding series to the admin ui's `KV Transaction Restarts` graph. These transaction restarts are caused by async consensus writes which fail while a transaction is attempting to pipeline writes. The restart occurs when the transaction visits the write's key while attempting to prove that the write succeeded and finds a missing intent. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

Informs cockroachdb#28876. This change adds a new metric called `txn.restarts.asyncwritefailure`, along with a corresponding series to the admin ui's `KV Transaction Restarts` graph. These transaction restarts are caused by async consensus writes which fail while a transaction is attempting to pipeline writes. The restart occurs when the transaction visits the write's key while attempting to prove that the write succeeded and finds a missing intent. Release note: None

nvanbenschoten · 2019-01-28T22:07:01Z

@timveil was hitting this reliably with a CREATE TABLE statement followed by immediate transactional writes to the table. I suspect that he was hitting a lease transfer on the new table's range, which was recently split off as a new range.

We should try much harder to avoid these errors. The long term solution would be to reissue the failed writes and verify that their results were the same as before (because we're not protected by latches and an earlier write could have slipped in). Now that writes within a transaction are idempotent, this should be possible, although it will add a large amount of complexity and requires additional bookkeeping in the transaction coordinator.

In the meantime, we should try to avoid hitting these errors as often. The biggest offenders here are lease transfers. An async write will fail to retry on a NotLeaseHolderError observed by a Raft proposal, meaning that any async write that hits one of these errors will result in a RETRY_ASYNC_WRITE_FAILURE failure. We should do our best to remedy this by making these errors rarer. We could accomplish this by acquiring non-mvcc read latches on the entire range for lease transfers. This would prevent writes from being run concurrently with lease transfers. @bdarnell do you have any insight into why we don't already do something like this?

bdarnell · 2019-01-28T22:16:55Z

do you have any insight into why we don't already do something like this?

I don't think there's any deep reason; we just hadn't thought of this before. But I'm not sure I understand the situation. As soon as we transfer the lease away, don't we poison our lease so we won't use it any more? We are fairly quick to return NotLeaseHolderError because we expect that they will always be retried. Why aren't they retried here?

nvanbenschoten · 2019-01-29T15:17:08Z

As soon as we transfer the lease away, don't we poison our lease so we won't use it any more? We are fairly quick to return NotLeaseHolderError because we expect that they will always be retried. Why aren't they retried here?

Yes, but we reject any in-flight proposals. The issue is that this rejection has no-one listening for it for any of these proposals that are asynchronous. Transactions only learn of these failures when they go back to query their intent.

bdarnell · 2019-01-29T18:19:20Z

Ah, got it. So the effect of using a span latch here would be to delay the lease transfer until we flush out the existing writes, not to prevent new writes while the transfer is in progress. That would probably work, although I feel like the loss of automatic retries for these async writes may have other unfortunate consequences and we may want to try to figure out some way for them to be automatically retried by a new owner once they've detached from their original connection.

nvanbenschoten · 2019-01-29T18:36:10Z

So the effect of using a span latch here would be to delay the lease transfer until we flush out the existing writes, not to prevent new writes while the transfer is in progress.

It would actually accomplish both. The latter is what I'm more worried about though, as any write proposed while a lease transfer is in progress is bound to fail.

That would probably work, although I feel like the loss of automatic retries for these async writes may have other unfortunate consequences and we may want to try to figure out some way for them to be automatically retried by a new owner once they've detached from their original connection.

We could be better about retrying without releasing latches if the leaseholder doesn't change (e.g. could we repropose on proposalIllegalLeaseIndex without re-evaluating?), but if it does then we've lost the "protection" of latching and we'll have to re-evaluate the batch. In that case, the evaluation could result in a different value, which isn't ok. That would need to result in a txn retry.

bdarnell · 2019-02-23T17:36:30Z

The difference between raft leader and leaseholder strikes again. After the TransferLease command resolves and releases its latches, we immediately begin to transfer raft leadership. It's the raft leadership change that causes the apply-time error.

We might be able to hang on to the latches during the raft transfer (or treat the raft transfer as a new pseudo-command that will get its own latches). Or as @nvanbenschoten 's last post said, we could retry in this case as long as the leaseholder doesn't change (which should combine with the addition of full-range latches on lease transfers to be a complete solution).

tbg · 2019-02-25T11:44:31Z

(e.g. could we repropose on proposalIllegalLeaseIndex without re-evaluating?)

We should be able to do that (I'm not quite clear why we don't - probably FUD about changing the MaxLeaseIndex of an existing proposal).

bdarnell · 2019-02-25T16:36:34Z

I think this may date back to some early speculation about propEvalKV, in which we planned to stack proposal batches on top of each other (instead, we rely on the span latches to ensure that proposals always commute). Or maybe it's something that was once necessary but other code changes have rendered moot. In any case, it looks to me now like we ought to be able to update the MaxLeaseIndex and handle everything within refreshProposalsLocked, which would be a nice simplification.

bdarnell · 2019-02-25T18:44:07Z

Aha, the reason behind the current design is that the primary source of proposalIllegalLeaseIndex is not in refreshProposalsLocked, it's in processRaftCommand. It's straightforward to transform these into immediate reproposals in refreshProposalsLocked, but doing the same in processRaftCommand seems to deadlock. Time to fight with the storage mutex hydra again...

bdarnell · 2019-02-25T21:18:48Z

OK, paging back in the reasons for the MaxLeaseIndex mechanism:

This is used for replay protection: We may propose the same command multiple times with the same MaxLeaseIndex. At most one of these can succeed (because a successful application sets LeaseAppliedIndex to its MaxLeaseIndex and all subsequent copies will fail with illegal lease indexes), and once we've seen one illegal lease index any remaining copies are guaranteed to fail. This allows us to retry safely on a lease index error as long as there hasn't been a prior success. The involvement of the caller (the executeWriteBatch loop) in the retry process is currently serving as an indirect test for whether we have seen a prior success (since that would cause the caller loop to exit). It looks like the presence of the proposal in r.mu.proposals can serve the same purpose, and we can repropose with a higher MaxLeaseIndex as long as the proposal is present (and the lease hasn't changed and we haven't applied a snapshot. I think our handling of this rare case may be incorrect, because we only consider this possibility in refreshProposalsLocked and not in processRaftCommand).

bdarnell · 2019-02-27T18:50:34Z

I have a fix for this that I'm polishing up.

(which should combine with the addition of full-range latches on lease transfers to be a complete solution).

I've abandoned any span latch changes for lease operations. I think they're unnecessary (We already record that we've started a transfer and don't accept new writes until it finishes. Any writes that predate the transfer are mostly going to be fine; it's rare to get a retry error at that point). More importantly, they have significant performance risks for expiration-based leases - I wouldn't want to see the liveness range become unwriteable while a lease renewal is in progress. I'm going to skip this part unless and until we can demonstrate a more concrete need for it.

nvanbenschoten · 2019-02-28T00:29:13Z

I've abandoned any span latch changes for lease operations. I think they're unnecessary (We already record that we've started a transfer and don't accept new writes until it finishes. Any writes that predate the transfer are mostly going to be fine; it's rare to get a retry error at that point).

That's true, although lease operations don't wait for existing writes to finish evaluation or replication, so it's possible for them to jump ahead of writes. If we have re-ordering within Raft then we're already lost (mod discussion above) because the writes will hit a proposalIllegalLeaseIndex. However, this situation also exposes us to standard NotLeaseholderErrors if we have re-ordering before Raft. I'm not sure if we ever see this problem in practice.

Still, I think your intuition about not latching is correct given that we at least block future writes.

In a pipelined write, the client executeWriteBatch is detached early, so it may not be around to retry in the event of a lease index error. These errors need to be retried at a lower level (or else they will bubble up to the client). In practice, this means that writes that raced with a leadership transfer (especially the first insert into a newly-created table) were likely to generate ASYNC_WRITE_FAILURE retries without this change. Fixes cockroachdb#28876 Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE transaction retry errors, especially for the first insert into a newly-created table.

bdarnell · 2019-03-04T17:56:24Z

#35261 fixes the async write failures, but the test it adds is still flaky under stress (it fails a lot even without stress on teamcity, but locally it passes pretty reliably without stress). Now it gets ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY on the commit. This appears to be caused by the lease transfer happening a little later, in between the transaction's last write and the commit, so the EndTransaction gets a retry error. If the lease transfer happened earlier, the write would have its timestamp bumped by the tscache and the internal span refresh would take effect. Maybe we're missing something when this error comes from the commit?

bdarnell · 2019-03-04T19:47:36Z

Ah, this is a TransactionAbortedError and not a TransactionRetryError. The logic here was touched by @nvanbenschoten 's recent changes but the issue here is older than that (and mentioned in #33396 (review)). We're using the timestamp cache as the source of truth for certain transaction conflicts, but this means that when the tscache gets flushed on lease transfers we must assume the worst and treat these errors as aborts (which can be retried at the SQL layer or above, but cannot be retried by the span refresher).

So this is another way that lease transfers are disruptive for clients that don't implement savepoint retries. Is there anything we can do about it? Using the abort span for operations on transaction records (and not just intents) might be one solution. I'm not sure what performance impact that would have or how much garbage it would leave lying around.

bdarnell · 2019-03-04T21:27:32Z

The logic here was touched by @nvanbenschoten 's recent changes but the issue here is older than that

I take that back. The code around EndTransaction and the timestamp cache has been this way for a long time, but #33523 changed things so that PushTxn updates the write timestamp cache. I think without that we'd be safe to retry after a flushed timestamp cache from the TxnCoordSender. But with PushTxn using the timestamp cache to communicate with EndTxn, we have to treat this situation as a potential abort.

andreimatei · 2019-03-04T22:57:33Z

We could marshall the write timestamp cache around on lease transfers to avoid this disruption, right?

bdarnell · 2019-03-04T23:08:03Z

Yes, that's one option. I don't know how big the write timestamp cache gets in practice. We could probably also filter out some parts of it (to only include entries where the txn id in the payload differs from the one in the key?)

andreimatei · 2019-03-04T23:18:46Z

Lazy question: which transactions exactly are subject to this disruption from the lease transfer? Is it only transactions that hadn't written their txn record yet at the moment the lease transfer happens, or is it all txns that were in-flight at the moment of the transfer? In other words, is the write timestamp cache consulted (by anything having to do with the txn record) even when there is a transaction record?
Cause I'm thinking another option is to speedup the first heartbeat (which writes the txn record, I believe). Right now, I think we do it 1s after the transaction sends its first write, but we could do it more aggressively.

bdarnell · 2019-03-04T23:25:50Z

It's called only for transactions that hadn't written their txn record yet. But the idea with the new changes is that txn records are only written by heartbeats and commits, and that's part of the speedup. We could make the first heartbeat more aggressive or we could go back to writing the txn record with BeginTxn to reduce the window for this issue, but that would run counter to the work @nvanbenschoten has been doing here.

andreimatei · 2019-03-04T23:38:10Z

Well, I don't know about "counter". I believe the main point of the change was as a preparation for parallel commits, where you don't want anything to wait for the BeginTxn. The problem was that the BeginTxn was synchronous, not so much the extra txn record write (although getting rid of that write is also worth something). I think.
But yeah, I don't really want to argue for more aggressive txn record creation; it seems that that would only lower the rate of such retries, not eliminate them. I'd rather argue for the transferring of the write timestamp cache (at least the txn records span).

If the wTsCache is too large, I think we can get away by just copying over a portion of it corresponding to the most recently aborted txns, and a corresponding low-watermark. And now I'm thinking - could we also keep the wTsCache small by removing entries from it once the client finds out that its txn has been aborted?

nvanbenschoten · 2019-03-05T04:05:20Z

or we could go back to writing the txn record with BeginTxn to reduce the window for this issue

Yes, as @bdarnell mentioned, this is how things worked for the past few years - 471811e. The write timestamp cache was updated on EndTransaction and consulted on BeginTransaction as a means of replay protection. So transactions always had a window of vulnerability during which lease transfers could cause a transaction abort. #33396 made it so that whichever request creates the transaction record is the one that performs this check. #33566 is what actually stopped sending BeginTransaction requests. This final change is what resulted in an expanded window of vulnerability for lease transfer-induced transaction aborts.

I think without that we'd be safe to retry after a flushed timestamp cache from the TxnCoordSender.

I'm curious to hear more about what you have in mind here. Are you saying that if we stopped using the write timestamp cache for PushTxn(ABORT), we could ignore possible replay errors as long as they made it back to the client? What do you mean by "flushed timestamp cache"? I need to think more about how that proposal works, especially with compound replays.

We could marshall the write timestamp cache around on lease transfers to avoid this disruption, right?

I like the idea of serializing portions of the timestamp cache and shipping them over lease exchanges. This really is an issue with a loss of information during lease transfers, so this seems like the natural place to solve it. I also wouldn't rule out that we might do the same thing for the read timestamp cache sometime in the future too, as the loss of resolution there can also cause retries that might bubble up to the client.

"If the wTsCache is too large, I think we can get away by just copying over a portion of it corresponding to the most recently aborted txns, and a corresponding low-watermark" is along the lines of what I was thinking. We would attempt to maintain full resolution in the new leaseholder's timestamp cache (over the txnID portion) within some max time window (1s?). This serialization could be as simple as:

txnKeys []roachpb.Key
lowWater hlc.Timestamp

We could read this straight out of the old leaseholder's timestamp cache.

And now I'm thinking - could we also keep the wTsCache small by removing entries from it once the client finds out that its txn has been aborted?

We could, although a structure like the timestamp cache is nice in that this GC is built in as it rolls over as time passes.

The problem was that the BeginTxn was synchronous, not so much the extra txn record write (although getting rid of that write is also worth something). I think.

Kind of. The problem was that BeginTxn stalled the EndTxn when acquiring latches because they touched the same key and didn't commute.

Also, FWIW I'm hoping to eliminate the existence of ABORTED transaction records entirely: see here. However, we might not get a lot from that if we still have persistent abort span entries to protect zombie transactions.

bdarnell · 2019-03-06T17:24:56Z

I'm curious to hear more about what you have in mind here. Are you saying that if we stopped using the write timestamp cache for PushTxn(ABORT), we could ignore possible replay errors as long as they made it back to the client?

We wouldn't ignore the replay errors, we would treat them as retryable. My thinking is that if the write timestamp cache is only used for interactions between BeginTxn and EndTxn (and their possible replays), the TxnCoordSender knows enough about what operations it has sent to know when it can safely retry without the possibility of duplicating a successful transaction. (this is assuming that we can treat the POSSIBLE_REPLAY signal as unambiguous and any duplication lower in the stack would cause it to be transformed into an AmbiguousResultError)

What do you mean by "flushed timestamp cache"?

I mean the zeroing (or rather low-water-marking) of the timestamp cache that occurs when leases change hands.

I like the idea of serializing portions of the timestamp cache and shipping them over lease exchanges. This really is an issue with a loss of information during lease transfers, so this seems like the natural place to solve it. I also wouldn't rule out that we might do the same thing for the read timestamp cache sometime in the future too, as the loss of resolution there can also cause retries that might bubble up to the client.

Yeah, it seems like a good idea to make lease transfers less costly in general.

Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). To remedy this, this commit moves reproposals due to illegal lease indexes below raft (but only on the proposing node) In practice, this can happen to writes that race with a raft leadership transfer. This is observable immediately after table creation, since table creation performs a split, and then may perform a lease transfer to balance load (and if the lease is transfer, raft leadership is transferred afterward). Specifically, 1. Lease is transferred from store s1 to s2. s1 is still raft leader. 2. Write w1 evaluates on store s2 and is assigned lease index i1. s2 forwards the proposal to s1. 3. s1 initiates raft leader transfer to s2. This puts it into a temporarily leaderless state so it drops the forwarded proposal. 4. s2 is elected raft leader, completing the transfer. 5. A second write w2 evalutes on s2, is assigned lease index i2, and goes right in the raft log since s2 is both leaseholder and leader. 6. s2 refreshes proposals as a side effect of becoming leader, and writes w1 to the log with lease index i1. 7. s2 applies w2, then w1. w1 fails because of the out of order lease index. If w1 was pipelined, the client stops listening after step 2, and won't learn of the failure until it tries to commit. At this point the commit returns an ASYNC_WRITE_FAILURE retry to the client. Note that in some cases, NotLeaseHolderError can be generated at apply time. These errors cannot be retried (since the proposer has lost its lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error to the client. However, this is uncommon - most NotLeaseHolderErrors are generated at evaluation time, which is compatible with pipelined writes. Fixes cockroachdb#28876 Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE transaction retry errors, especially for the first insert into a newly-created table.

35261: storage: Improve handling of lease index retries r=tbg,nvanbenschoten a=bdarnell Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). To remedy this, this commit moves reproposals due to illegal lease indexes below raft (but only on the proposing node) In practice, this can happen to writes that race with a raft leadership transfer. This is observable immediately after table creation, since table creation performs a split, and then may perform a lease transfer to balance load (and if the lease is transfer, raft leadership is transferred afterward). Specifically, 1. Lease is transferred from store s1 to s2. s1 is still raft leader. 2. Write w1 evaluates on store s2 and is assigned lease index i1. s2 forwards the proposal to s1. 3. s1 initiates raft leader transfer to s2. This puts it into a temporarily leaderless state so it drops the forwarded proposal. 4. s2 is elected raft leader, completing the transfer. 5. A second write w2 evalutes on s2, is assigned lease index i2, and goes right in the raft log since s2 is both leaseholder and leader. 6. s2 refreshes proposals as a side effect of becoming leader, and writes w1 to the log with lease index i1. 7. s2 applies w2, then w1. w1 fails because of the out of order lease index. If w1 was pipelined, the client stops listening after step 2, and won't learn of the failure until it tries to commit. At this point the commit returns an ASYNC_WRITE_FAILURE retry to the client. Note that in some cases, NotLeaseHolderError can be generated at apply time. These errors cannot be retried (since the proposer has lost its lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error to the client. However, this is uncommon - most NotLeaseHolderErrors are generated at evaluation time, which is compatible with pipelined writes. Fixes #28876 The fourth commit is the important part of this change. The fifth is optional; it's a simplification but it changes an under-tested error path. Assuming we like the fifth commit, I'd like to add a sixth that rips out proposalReevaluationReason and the executeWriteBatch/tryExecuteWriteBatch loop altogether. Co-authored-by: Ben Darnell <[email protected]>

bdarnell · 2019-03-19T16:59:26Z

#35261 fixed the major identified cause of async_write_failure retry errors, so I'm considering this complete. These errors should now only happen when there is a transaction conflict that wipes out one of our pending intents.

Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). To remedy this, this commit moves reproposals due to illegal lease indexes below raft (but only on the proposing node) In practice, this can happen to writes that race with a raft leadership transfer. This is observable immediately after table creation, since table creation performs a split, and then may perform a lease transfer to balance load (and if the lease is transfer, raft leadership is transferred afterward). Specifically, 1. Lease is transferred from store s1 to s2. s1 is still raft leader. 2. Write w1 evaluates on store s2 and is assigned lease index i1. s2 forwards the proposal to s1. 3. s1 initiates raft leader transfer to s2. This puts it into a temporarily leaderless state so it drops the forwarded proposal. 4. s2 is elected raft leader, completing the transfer. 5. A second write w2 evalutes on s2, is assigned lease index i2, and goes right in the raft log since s2 is both leaseholder and leader. 6. s2 refreshes proposals as a side effect of becoming leader, and writes w1 to the log with lease index i1. 7. s2 applies w2, then w1. w1 fails because of the out of order lease index. If w1 was pipelined, the client stops listening after step 2, and won't learn of the failure until it tries to commit. At this point the commit returns an ASYNC_WRITE_FAILURE retry to the client. Note that in some cases, NotLeaseHolderError can be generated at apply time. These errors cannot be retried (since the proposer has lost its lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error to the client. However, this is uncommon - most NotLeaseHolderErrors are generated at evaluation time, which is compatible with pipelined writes. Fixes cockroachdb#28876 Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE transaction retry errors, especially for the first insert into a newly-created table.

andreimatei added C-investigation Further steps needed to qualify. C-label will change. a-core-storage labels Aug 21, 2018

andreimatei assigned nvanbenschoten Aug 21, 2018

tbg added A-kv-client Relating to the KV client and the KV interface. and removed a-core-storage labels Aug 21, 2018

nvanbenschoten mentioned this issue Sep 4, 2018

kv/ui: add metric and graph for AsyncWriteFailure restarts #29499

Merged

nvanbenschoten mentioned this issue Sep 11, 2018

release-2.1: kv/ui: add metric and graph for AsyncWriteFailure restarts #30054

Merged

tbg added A-coreperf and removed A-kv-client Relating to the KV client and the KV interface. labels Sep 27, 2018

tbg added this to the 2.2 milestone Sep 27, 2018

nvanbenschoten added A-kv-client Relating to the KV client and the KV interface. and removed A-coreperf labels Oct 12, 2018

nvanbenschoten changed the title ~~kv, storage: RETRY_ASYNC_WRITE_FAILURE easy to produce in roachtest~~ kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures Jan 28, 2019

bdarnell mentioned this issue Feb 28, 2019

storage: Improve handling of lease index retries #35261

Merged

craig bot closed this as completed in #35261 Mar 19, 2019

bdarnell mentioned this issue Mar 25, 2019

release-19.1: storage: Improve handling of lease index retries #36104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

andreimatei commented Aug 21, 2018

andreimatei commented Aug 21, 2018

nvanbenschoten commented Aug 21, 2018

nvanbenschoten commented Aug 21, 2018 •

edited

Loading

andreimatei commented Aug 21, 2018

nvanbenschoten commented Aug 27, 2018

nvanbenschoten commented Jan 28, 2019

bdarnell commented Jan 28, 2019

nvanbenschoten commented Jan 29, 2019

bdarnell commented Jan 29, 2019

nvanbenschoten commented Jan 29, 2019

bdarnell commented Feb 23, 2019

tbg commented Feb 25, 2019

bdarnell commented Feb 25, 2019

bdarnell commented Feb 25, 2019

bdarnell commented Feb 25, 2019

bdarnell commented Feb 27, 2019

nvanbenschoten commented Feb 28, 2019

bdarnell commented Mar 4, 2019

bdarnell commented Mar 4, 2019

bdarnell commented Mar 4, 2019

andreimatei commented Mar 4, 2019

bdarnell commented Mar 4, 2019

andreimatei commented Mar 4, 2019

bdarnell commented Mar 4, 2019

andreimatei commented Mar 4, 2019

nvanbenschoten commented Mar 5, 2019

bdarnell commented Mar 6, 2019

bdarnell commented Mar 19, 2019

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

Comments

andreimatei commented Aug 21, 2018

andreimatei commented Aug 21, 2018

nvanbenschoten commented Aug 21, 2018

nvanbenschoten commented Aug 21, 2018 • edited Loading

andreimatei commented Aug 21, 2018

nvanbenschoten commented Aug 27, 2018

nvanbenschoten commented Jan 28, 2019

bdarnell commented Jan 28, 2019

nvanbenschoten commented Jan 29, 2019

bdarnell commented Jan 29, 2019

nvanbenschoten commented Jan 29, 2019

bdarnell commented Feb 23, 2019

tbg commented Feb 25, 2019

bdarnell commented Feb 25, 2019

bdarnell commented Feb 25, 2019

bdarnell commented Feb 25, 2019

bdarnell commented Feb 27, 2019

nvanbenschoten commented Feb 28, 2019

bdarnell commented Mar 4, 2019

bdarnell commented Mar 4, 2019

bdarnell commented Mar 4, 2019

andreimatei commented Mar 4, 2019

bdarnell commented Mar 4, 2019

andreimatei commented Mar 4, 2019

bdarnell commented Mar 4, 2019

andreimatei commented Mar 4, 2019

nvanbenschoten commented Mar 5, 2019

bdarnell commented Mar 6, 2019

bdarnell commented Mar 19, 2019

nvanbenschoten commented Aug 21, 2018 •

edited

Loading