Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

Closed
andreimatei opened this issue Aug 21, 2018 · 29 comments
Closed

kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876

andreimatei opened this issue Aug 21, 2018 · 29 comments
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-investigation Further steps needed to qualify. C-label will change.
Milestone

Comments

@andreimatei
Copy link
Contributor

@nvanbenschoten , the RETRY_ASYNC_WRITE_FAILURE error I was telling you about is very easy to repro on master, so I'm thinking maybe it's worth verifying that it indeed happens because of a lease transfer. Maybe there's something else too going on (based on nothing but the frequency).
I'm running something like bin/roachtest run interleavedpartitioned --count 100 reusing a cluster (so --cluster <name>, and more often then not the test fails within 20s or so. Looks like if it doesn't fail in the beginning, it won't fail later on either.
I've also changed the duration below to 30s to speed up the process:

duration := " --duration " + ifLocal("10s", "10m")

What do you think? Feel free to close or pass back if you're not interested.

I've separately opened #28873 complaining that the test doesn't do retries.

@andreimatei andreimatei added C-investigation Further steps needed to qualify. C-label will change. a-core-storage labels Aug 21, 2018
@andreimatei
Copy link
Contributor Author

And also I wanted to say - if indeed the reason why we get more of these error now is that for pipelined writes there's no retry loop in the execWriteBatch, then couldn't we add that retry loop back (for reproposals, not for re-evaluations) even if the client is not waiting for the response?

@nvanbenschoten
Copy link
Member

Yeah, this is worth an investigation to at least determine what the cause of these retryable errors is, thanks for filing.

For the record, the retry loop is still there for reproposals, just not for re-evaluations. I think we could try to revive some re-evaluation if we compared the result against the previous result, but that seems messy. A better approach is to make writes idempotent and re-issue (possibly synchronously) them if they are ever found to be missing.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 21, 2018

A better approach is to make writes idempotent and re-issue (possibly synchronously) them if they are ever found to be missing.

It might be true that we can do this even without idempotency if we can prove that a missing write implies that that write will never succeed. In that case, we could transparently retry from the client and continue without a client-side retry as long as the result is the same (not dissimilar to the refresh span optimization). The main downside to that is that we'd then need to hold on to BatchRequest for all outstanding writes on the client.

@andreimatei
Copy link
Contributor Author

For the record, the retry loop is still there for reproposals, just not for re-evaluations

Yeah, I meant the other way around here; re-evaluations is what we'd want here. So why exactly can't the retry loop (not the client) perform re-evaluations too?

@tbg tbg added A-kv-client Relating to the KV client and the KV interface. and removed a-core-storage labels Aug 21, 2018
@nvanbenschoten
Copy link
Member

So why exactly can't the retry loop (not the client) perform re-evaluations too?

Because once we leave the command queue, the re-evaluation might produce a different result.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 4, 2018
Informs cockroachdb#28876.

This change adds a new metric called `txn.restarts.asyncwritefailure`,
along with a corresponding series to the admin ui's `KV Transaction
Restarts` graph. These transaction restarts are caused by async
consensus writes which fail while a transaction is attempting to
pipeline writes. The restart occurs when the transaction visits the
write's key while attempting to prove that the write succeeded and
finds a missing intent.

Release note: None
craig bot pushed a commit that referenced this issue Sep 4, 2018
29499: kv/ui: add metric and graph for AsyncWriteFailure restarts r=nvanbenschoten a=nvanbenschoten

Informs #28876.

This change adds a new metric called `txn.restarts.asyncwritefailure`,
along with a corresponding series to the admin ui's `KV Transaction
Restarts` graph. These transaction restarts are caused by async
consensus writes which fail while a transaction is attempting to
pipeline writes. The restart occurs when the transaction visits the
write's key while attempting to prove that the write succeeded and
finds a missing intent.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 11, 2018
Informs cockroachdb#28876.

This change adds a new metric called `txn.restarts.asyncwritefailure`,
along with a corresponding series to the admin ui's `KV Transaction
Restarts` graph. These transaction restarts are caused by async
consensus writes which fail while a transaction is attempting to
pipeline writes. The restart occurs when the transaction visits the
write's key while attempting to prove that the write succeeded and
finds a missing intent.

Release note: None
@tbg tbg added A-coreperf and removed A-kv-client Relating to the KV client and the KV interface. labels Sep 27, 2018
@tbg tbg added this to the 2.2 milestone Sep 27, 2018
@nvanbenschoten nvanbenschoten added A-kv-client Relating to the KV client and the KV interface. and removed A-coreperf labels Oct 12, 2018
@nvanbenschoten nvanbenschoten changed the title kv, storage: RETRY_ASYNC_WRITE_FAILURE easy to produce in roachtest kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures Jan 28, 2019
@nvanbenschoten
Copy link
Member

@timveil was hitting this reliably with a CREATE TABLE statement followed by immediate transactional writes to the table. I suspect that he was hitting a lease transfer on the new table's range, which was recently split off as a new range.

We should try much harder to avoid these errors. The long term solution would be to reissue the failed writes and verify that their results were the same as before (because we're not protected by latches and an earlier write could have slipped in). Now that writes within a transaction are idempotent, this should be possible, although it will add a large amount of complexity and requires additional bookkeeping in the transaction coordinator.

In the meantime, we should try to avoid hitting these errors as often. The biggest offenders here are lease transfers. An async write will fail to retry on a NotLeaseHolderError observed by a Raft proposal, meaning that any async write that hits one of these errors will result in a RETRY_ASYNC_WRITE_FAILURE failure. We should do our best to remedy this by making these errors rarer. We could accomplish this by acquiring non-mvcc read latches on the entire range for lease transfers. This would prevent writes from being run concurrently with lease transfers. @bdarnell do you have any insight into why we don't already do something like this?

@bdarnell
Copy link
Contributor

do you have any insight into why we don't already do something like this?

I don't think there's any deep reason; we just hadn't thought of this before. But I'm not sure I understand the situation. As soon as we transfer the lease away, don't we poison our lease so we won't use it any more? We are fairly quick to return NotLeaseHolderError because we expect that they will always be retried. Why aren't they retried here?

@nvanbenschoten
Copy link
Member

As soon as we transfer the lease away, don't we poison our lease so we won't use it any more? We are fairly quick to return NotLeaseHolderError because we expect that they will always be retried. Why aren't they retried here?

Yes, but we reject any in-flight proposals. The issue is that this rejection has no-one listening for it for any of these proposals that are asynchronous. Transactions only learn of these failures when they go back to query their intent.

@bdarnell
Copy link
Contributor

Ah, got it. So the effect of using a span latch here would be to delay the lease transfer until we flush out the existing writes, not to prevent new writes while the transfer is in progress. That would probably work, although I feel like the loss of automatic retries for these async writes may have other unfortunate consequences and we may want to try to figure out some way for them to be automatically retried by a new owner once they've detached from their original connection.

@nvanbenschoten
Copy link
Member

So the effect of using a span latch here would be to delay the lease transfer until we flush out the existing writes, not to prevent new writes while the transfer is in progress.

It would actually accomplish both. The latter is what I'm more worried about though, as any write proposed while a lease transfer is in progress is bound to fail.

That would probably work, although I feel like the loss of automatic retries for these async writes may have other unfortunate consequences and we may want to try to figure out some way for them to be automatically retried by a new owner once they've detached from their original connection.

We could be better about retrying without releasing latches if the leaseholder doesn't change (e.g. could we repropose on proposalIllegalLeaseIndex without re-evaluating?), but if it does then we've lost the "protection" of latching and we'll have to re-evaluate the batch. In that case, the evaluation could result in a different value, which isn't ok. That would need to result in a txn retry.

@bdarnell
Copy link
Contributor

The difference between raft leader and leaseholder strikes again. After the TransferLease command resolves and releases its latches, we immediately begin to transfer raft leadership. It's the raft leadership change that causes the apply-time error.

We might be able to hang on to the latches during the raft transfer (or treat the raft transfer as a new pseudo-command that will get its own latches). Or as @nvanbenschoten 's last post said, we could retry in this case as long as the leaseholder doesn't change (which should combine with the addition of full-range latches on lease transfers to be a complete solution).

@tbg
Copy link
Member

tbg commented Feb 25, 2019

(e.g. could we repropose on proposalIllegalLeaseIndex without re-evaluating?)

We should be able to do that (I'm not quite clear why we don't - probably FUD about changing the MaxLeaseIndex of an existing proposal).

@bdarnell
Copy link
Contributor

I think this may date back to some early speculation about propEvalKV, in which we planned to stack proposal batches on top of each other (instead, we rely on the span latches to ensure that proposals always commute). Or maybe it's something that was once necessary but other code changes have rendered moot. In any case, it looks to me now like we ought to be able to update the MaxLeaseIndex and handle everything within refreshProposalsLocked, which would be a nice simplification.

@bdarnell
Copy link
Contributor

Aha, the reason behind the current design is that the primary source of proposalIllegalLeaseIndex is not in refreshProposalsLocked, it's in processRaftCommand. It's straightforward to transform these into immediate reproposals in refreshProposalsLocked, but doing the same in processRaftCommand seems to deadlock. Time to fight with the storage mutex hydra again...

@bdarnell
Copy link
Contributor

OK, paging back in the reasons for the MaxLeaseIndex mechanism:

This is used for replay protection: We may propose the same command multiple times with the same MaxLeaseIndex. At most one of these can succeed (because a successful application sets LeaseAppliedIndex to its MaxLeaseIndex and all subsequent copies will fail with illegal lease indexes), and once we've seen one illegal lease index any remaining copies are guaranteed to fail. This allows us to retry safely on a lease index error as long as there hasn't been a prior success. The involvement of the caller (the executeWriteBatch loop) in the retry process is currently serving as an indirect test for whether we have seen a prior success (since that would cause the caller loop to exit). It looks like the presence of the proposal in r.mu.proposals can serve the same purpose, and we can repropose with a higher MaxLeaseIndex as long as the proposal is present (and the lease hasn't changed and we haven't applied a snapshot. I think our handling of this rare case may be incorrect, because we only consider this possibility in refreshProposalsLocked and not in processRaftCommand).

@bdarnell
Copy link
Contributor

I have a fix for this that I'm polishing up.

(which should combine with the addition of full-range latches on lease transfers to be a complete solution).

I've abandoned any span latch changes for lease operations. I think they're unnecessary (We already record that we've started a transfer and don't accept new writes until it finishes. Any writes that predate the transfer are mostly going to be fine; it's rare to get a retry error at that point). More importantly, they have significant performance risks for expiration-based leases - I wouldn't want to see the liveness range become unwriteable while a lease renewal is in progress. I'm going to skip this part unless and until we can demonstrate a more concrete need for it.

@nvanbenschoten
Copy link
Member

I've abandoned any span latch changes for lease operations. I think they're unnecessary (We already record that we've started a transfer and don't accept new writes until it finishes. Any writes that predate the transfer are mostly going to be fine; it's rare to get a retry error at that point).

That's true, although lease operations don't wait for existing writes to finish evaluation or replication, so it's possible for them to jump ahead of writes. If we have re-ordering within Raft then we're already lost (mod discussion above) because the writes will hit a proposalIllegalLeaseIndex. However, this situation also exposes us to standard NotLeaseholderErrors if we have re-ordering before Raft. I'm not sure if we ever see this problem in practice.

Still, I think your intuition about not latching is correct given that we at least block future writes.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Feb 28, 2019
In a pipelined write, the client executeWriteBatch is detached early,
so it may not be around to retry in the event of a lease index error.
These errors need to be retried at a lower level (or else they will
bubble up to the client).

In practice, this means that writes that raced with a leadership
transfer (especially the first insert into a newly-created table) were
likely to generate ASYNC_WRITE_FAILURE retries without this change.

Fixes cockroachdb#28876

Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE
transaction retry errors, especially for the first insert into a
newly-created table.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Feb 28, 2019
In a pipelined write, the client executeWriteBatch is detached early,
so it may not be around to retry in the event of a lease index error.
These errors need to be retried at a lower level (or else they will
bubble up to the client).

In practice, this means that writes that raced with a leadership
transfer (especially the first insert into a newly-created table) were
likely to generate ASYNC_WRITE_FAILURE retries without this change.

Fixes cockroachdb#28876

Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE
transaction retry errors, especially for the first insert into a
newly-created table.
@bdarnell
Copy link
Contributor

bdarnell commented Mar 4, 2019

#35261 fixes the async write failures, but the test it adds is still flaky under stress (it fails a lot even without stress on teamcity, but locally it passes pretty reliably without stress). Now it gets ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY on the commit. This appears to be caused by the lease transfer happening a little later, in between the transaction's last write and the commit, so the EndTransaction gets a retry error. If the lease transfer happened earlier, the write would have its timestamp bumped by the tscache and the internal span refresh would take effect. Maybe we're missing something when this error comes from the commit?

@bdarnell
Copy link
Contributor

bdarnell commented Mar 4, 2019

Ah, this is a TransactionAbortedError and not a TransactionRetryError. The logic here was touched by @nvanbenschoten 's recent changes but the issue here is older than that (and mentioned in #33396 (review)). We're using the timestamp cache as the source of truth for certain transaction conflicts, but this means that when the tscache gets flushed on lease transfers we must assume the worst and treat these errors as aborts (which can be retried at the SQL layer or above, but cannot be retried by the span refresher).

So this is another way that lease transfers are disruptive for clients that don't implement savepoint retries. Is there anything we can do about it? Using the abort span for operations on transaction records (and not just intents) might be one solution. I'm not sure what performance impact that would have or how much garbage it would leave lying around.

@bdarnell
Copy link
Contributor

bdarnell commented Mar 4, 2019

The logic here was touched by @nvanbenschoten 's recent changes but the issue here is older than that

I take that back. The code around EndTransaction and the timestamp cache has been this way for a long time, but #33523 changed things so that PushTxn updates the write timestamp cache. I think without that we'd be safe to retry after a flushed timestamp cache from the TxnCoordSender. But with PushTxn using the timestamp cache to communicate with EndTxn, we have to treat this situation as a potential abort.

@andreimatei
Copy link
Contributor Author

We could marshall the write timestamp cache around on lease transfers to avoid this disruption, right?

@bdarnell
Copy link
Contributor

bdarnell commented Mar 4, 2019

Yes, that's one option. I don't know how big the write timestamp cache gets in practice. We could probably also filter out some parts of it (to only include entries where the txn id in the payload differs from the one in the key?)

@andreimatei
Copy link
Contributor Author

Lazy question: which transactions exactly are subject to this disruption from the lease transfer? Is it only transactions that hadn't written their txn record yet at the moment the lease transfer happens, or is it all txns that were in-flight at the moment of the transfer? In other words, is the write timestamp cache consulted (by anything having to do with the txn record) even when there is a transaction record?
Cause I'm thinking another option is to speedup the first heartbeat (which writes the txn record, I believe). Right now, I think we do it 1s after the transaction sends its first write, but we could do it more aggressively.

@bdarnell
Copy link
Contributor

bdarnell commented Mar 4, 2019

It's called only for transactions that hadn't written their txn record yet. But the idea with the new changes is that txn records are only written by heartbeats and commits, and that's part of the speedup. We could make the first heartbeat more aggressive or we could go back to writing the txn record with BeginTxn to reduce the window for this issue, but that would run counter to the work @nvanbenschoten has been doing here.

@andreimatei
Copy link
Contributor Author

Well, I don't know about "counter". I believe the main point of the change was as a preparation for parallel commits, where you don't want anything to wait for the BeginTxn. The problem was that the BeginTxn was synchronous, not so much the extra txn record write (although getting rid of that write is also worth something). I think.
But yeah, I don't really want to argue for more aggressive txn record creation; it seems that that would only lower the rate of such retries, not eliminate them. I'd rather argue for the transferring of the write timestamp cache (at least the txn records span).

If the wTsCache is too large, I think we can get away by just copying over a portion of it corresponding to the most recently aborted txns, and a corresponding low-watermark. And now I'm thinking - could we also keep the wTsCache small by removing entries from it once the client finds out that its txn has been aborted?

@nvanbenschoten
Copy link
Member

or we could go back to writing the txn record with BeginTxn to reduce the window for this issue

Yes, as @bdarnell mentioned, this is how things worked for the past few years - 471811e. The write timestamp cache was updated on EndTransaction and consulted on BeginTransaction as a means of replay protection. So transactions always had a window of vulnerability during which lease transfers could cause a transaction abort. #33396 made it so that whichever request creates the transaction record is the one that performs this check. #33566 is what actually stopped sending BeginTransaction requests. This final change is what resulted in an expanded window of vulnerability for lease transfer-induced transaction aborts.

I think without that we'd be safe to retry after a flushed timestamp cache from the TxnCoordSender.

I'm curious to hear more about what you have in mind here. Are you saying that if we stopped using the write timestamp cache for PushTxn(ABORT), we could ignore possible replay errors as long as they made it back to the client? What do you mean by "flushed timestamp cache"? I need to think more about how that proposal works, especially with compound replays.

We could marshall the write timestamp cache around on lease transfers to avoid this disruption, right?

I like the idea of serializing portions of the timestamp cache and shipping them over lease exchanges. This really is an issue with a loss of information during lease transfers, so this seems like the natural place to solve it. I also wouldn't rule out that we might do the same thing for the read timestamp cache sometime in the future too, as the loss of resolution there can also cause retries that might bubble up to the client.

"If the wTsCache is too large, I think we can get away by just copying over a portion of it corresponding to the most recently aborted txns, and a corresponding low-watermark" is along the lines of what I was thinking. We would attempt to maintain full resolution in the new leaseholder's timestamp cache (over the txnID portion) within some max time window (1s?). This serialization could be as simple as:

txnKeys []roachpb.Key
lowWater hlc.Timestamp

We could read this straight out of the old leaseholder's timestamp cache.

And now I'm thinking - could we also keep the wTsCache small by removing entries from it once the client finds out that its txn has been aborted?

We could, although a structure like the timestamp cache is nice in that this GC is built in as it rolls over as time passes.

The problem was that the BeginTxn was synchronous, not so much the extra txn record write (although getting rid of that write is also worth something). I think.

Kind of. The problem was that BeginTxn stalled the EndTxn when acquiring latches because they touched the same key and didn't commute.

Also, FWIW I'm hoping to eliminate the existence of ABORTED transaction records entirely: see here. However, we might not get a lot from that if we still have persistent abort span entries to protect zombie transactions.

@bdarnell
Copy link
Contributor

bdarnell commented Mar 6, 2019

I'm curious to hear more about what you have in mind here. Are you saying that if we stopped using the write timestamp cache for PushTxn(ABORT), we could ignore possible replay errors as long as they made it back to the client?

We wouldn't ignore the replay errors, we would treat them as retryable. My thinking is that if the write timestamp cache is only used for interactions between BeginTxn and EndTxn (and their possible replays), the TxnCoordSender knows enough about what operations it has sent to know when it can safely retry without the possibility of duplicating a successful transaction. (this is assuming that we can treat the POSSIBLE_REPLAY signal as unambiguous and any duplication lower in the stack would cause it to be transformed into an AmbiguousResultError)

What do you mean by "flushed timestamp cache"?

I mean the zeroing (or rather low-water-marking) of the timestamp cache that occurs when leases change hands.

I like the idea of serializing portions of the timestamp cache and shipping them over lease exchanges. This really is an issue with a loss of information during lease transfers, so this seems like the natural place to solve it. I also wouldn't rule out that we might do the same thing for the read timestamp cache sometime in the future too, as the loss of resolution there can also cause retries that might bubble up to the client.

Yeah, it seems like a good idea to make lease transfers less costly in general.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 17, 2019
Pipelined writes wait for commands to be evaluated, then detach before
they apply. Retryable errors generated at evaluation time are handled
correctly (by DistSender or above) even for pipelined writes, but
pipelined writes have lost the ability to handle apply-time retry
conditions in the executeWriteBatch loop (there used to be more of
these, but illegal lease index is now the only condition retried at
this level now). To remedy this, this commit moves reproposals due to
illegal lease indexes below raft (but only on the proposing node)

In practice, this can happen to writes that race with a raft
leadership transfer. This is observable immediately after table
creation, since table creation performs a split, and then may perform
a lease transfer to balance load (and if the lease is transfer, raft
leadership is transferred afterward).

Specifically,
1. Lease is transferred from store s1 to s2. s1 is still raft leader.
2. Write w1 evaluates on store s2 and is assigned lease index i1. s2
forwards the proposal to s1.
3. s1 initiates raft leader transfer to s2. This puts it into a
temporarily leaderless state so it drops the forwarded proposal.
4. s2 is elected raft leader, completing the transfer.
5. A second write w2 evalutes on s2, is assigned lease index i2, and
goes right in the raft log since s2 is both leaseholder and leader.
6. s2 refreshes proposals as a side effect of becoming leader, and
writes w1 to the log with lease index i1.
7. s2 applies w2, then w1. w1 fails because of the out of order lease
index.

If w1 was pipelined, the client stops listening after step 2, and
won't learn of the failure until it tries to commit. At this point the
commit returns an ASYNC_WRITE_FAILURE retry to the client.

Note that in some cases, NotLeaseHolderError can be generated at apply
time. These errors cannot be retried (since the proposer has lost its
lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error
to the client. However, this is uncommon - most NotLeaseHolderErrors
are generated at evaluation time, which is compatible with pipelined
writes.

Fixes cockroachdb#28876

Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE
transaction retry errors, especially for the first insert into a
newly-created table.
craig bot pushed a commit that referenced this issue Mar 19, 2019
35261: storage: Improve handling of lease index retries r=tbg,nvanbenschoten a=bdarnell

Pipelined writes wait for commands to be evaluated, then detach before
they apply. Retryable errors generated at evaluation time are handled
correctly (by DistSender or above) even for pipelined writes, but
pipelined writes have lost the ability to handle apply-time retry
conditions in the executeWriteBatch loop (there used to be more of
these, but illegal lease index is now the only condition retried at
this level now). To remedy this, this commit moves reproposals due to
illegal lease indexes below raft (but only on the proposing node)

In practice, this can happen to writes that race with a raft
leadership transfer. This is observable immediately after table
creation, since table creation performs a split, and then may perform
a lease transfer to balance load (and if the lease is transfer, raft
leadership is transferred afterward).

Specifically,
1. Lease is transferred from store s1 to s2. s1 is still raft leader.
2. Write w1 evaluates on store s2 and is assigned lease index i1. s2
forwards the proposal to s1.
3. s1 initiates raft leader transfer to s2. This puts it into a
temporarily leaderless state so it drops the forwarded proposal.
4. s2 is elected raft leader, completing the transfer.
5. A second write w2 evalutes on s2, is assigned lease index i2, and
goes right in the raft log since s2 is both leaseholder and leader.
6. s2 refreshes proposals as a side effect of becoming leader, and
writes w1 to the log with lease index i1.
7. s2 applies w2, then w1. w1 fails because of the out of order lease
index.

If w1 was pipelined, the client stops listening after step 2, and
won't learn of the failure until it tries to commit. At this point the
commit returns an ASYNC_WRITE_FAILURE retry to the client.

Note that in some cases, NotLeaseHolderError can be generated at apply
time. These errors cannot be retried (since the proposer has lost its
lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error
to the client. However, this is uncommon - most NotLeaseHolderErrors
are generated at evaluation time, which is compatible with pipelined
writes.
Fixes #28876

The fourth commit is the important part of this change. The fifth is optional; it's a simplification but it changes an under-tested error path. Assuming we like the fifth commit, I'd like to add a sixth that rips out proposalReevaluationReason and the executeWriteBatch/tryExecuteWriteBatch loop altogether. 

Co-authored-by: Ben Darnell <[email protected]>
@craig craig bot closed this as completed in #35261 Mar 19, 2019
@bdarnell
Copy link
Contributor

#35261 fixed the major identified cause of async_write_failure retry errors, so I'm considering this complete. These errors should now only happen when there is a transaction conflict that wipes out one of our pending intents.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Mar 25, 2019
Pipelined writes wait for commands to be evaluated, then detach before
they apply. Retryable errors generated at evaluation time are handled
correctly (by DistSender or above) even for pipelined writes, but
pipelined writes have lost the ability to handle apply-time retry
conditions in the executeWriteBatch loop (there used to be more of
these, but illegal lease index is now the only condition retried at
this level now). To remedy this, this commit moves reproposals due to
illegal lease indexes below raft (but only on the proposing node)

In practice, this can happen to writes that race with a raft
leadership transfer. This is observable immediately after table
creation, since table creation performs a split, and then may perform
a lease transfer to balance load (and if the lease is transfer, raft
leadership is transferred afterward).

Specifically,
1. Lease is transferred from store s1 to s2. s1 is still raft leader.
2. Write w1 evaluates on store s2 and is assigned lease index i1. s2
forwards the proposal to s1.
3. s1 initiates raft leader transfer to s2. This puts it into a
temporarily leaderless state so it drops the forwarded proposal.
4. s2 is elected raft leader, completing the transfer.
5. A second write w2 evalutes on s2, is assigned lease index i2, and
goes right in the raft log since s2 is both leaseholder and leader.
6. s2 refreshes proposals as a side effect of becoming leader, and
writes w1 to the log with lease index i1.
7. s2 applies w2, then w1. w1 fails because of the out of order lease
index.

If w1 was pipelined, the client stops listening after step 2, and
won't learn of the failure until it tries to commit. At this point the
commit returns an ASYNC_WRITE_FAILURE retry to the client.

Note that in some cases, NotLeaseHolderError can be generated at apply
time. These errors cannot be retried (since the proposer has lost its
lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error
to the client. However, this is uncommon - most NotLeaseHolderErrors
are generated at evaluation time, which is compatible with pipelined
writes.

Fixes cockroachdb#28876

Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE
transaction retry errors, especially for the first insert into a
newly-created table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

4 participants