-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv, storage: reduce frequency of RETRY_ASYNC_WRITE_FAILURE failures #28876
Comments
And also I wanted to say - if indeed the reason why we get more of these error now is that for pipelined writes there's no retry loop in the |
Yeah, this is worth an investigation to at least determine what the cause of these retryable errors is, thanks for filing. For the record, the retry loop is still there for reproposals, just not for re-evaluations. I think we could try to revive some re-evaluation if we compared the result against the previous result, but that seems messy. A better approach is to make writes idempotent and re-issue (possibly synchronously) them if they are ever found to be missing. |
It might be true that we can do this even without idempotency if we can prove that a missing write implies that that write will never succeed. In that case, we could transparently retry from the client and continue without a client-side retry as long as the result is the same (not dissimilar to the refresh span optimization). The main downside to that is that we'd then need to hold on to BatchRequest for all outstanding writes on the client. |
Yeah, I meant the other way around here; re-evaluations is what we'd want here. So why exactly can't the retry loop (not the client) perform re-evaluations too? |
Because once we leave the command queue, the re-evaluation might produce a different result. |
Informs cockroachdb#28876. This change adds a new metric called `txn.restarts.asyncwritefailure`, along with a corresponding series to the admin ui's `KV Transaction Restarts` graph. These transaction restarts are caused by async consensus writes which fail while a transaction is attempting to pipeline writes. The restart occurs when the transaction visits the write's key while attempting to prove that the write succeeded and finds a missing intent. Release note: None
29499: kv/ui: add metric and graph for AsyncWriteFailure restarts r=nvanbenschoten a=nvanbenschoten Informs #28876. This change adds a new metric called `txn.restarts.asyncwritefailure`, along with a corresponding series to the admin ui's `KV Transaction Restarts` graph. These transaction restarts are caused by async consensus writes which fail while a transaction is attempting to pipeline writes. The restart occurs when the transaction visits the write's key while attempting to prove that the write succeeded and finds a missing intent. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Informs cockroachdb#28876. This change adds a new metric called `txn.restarts.asyncwritefailure`, along with a corresponding series to the admin ui's `KV Transaction Restarts` graph. These transaction restarts are caused by async consensus writes which fail while a transaction is attempting to pipeline writes. The restart occurs when the transaction visits the write's key while attempting to prove that the write succeeded and finds a missing intent. Release note: None
@timveil was hitting this reliably with a CREATE TABLE statement followed by immediate transactional writes to the table. I suspect that he was hitting a lease transfer on the new table's range, which was recently split off as a new range. We should try much harder to avoid these errors. The long term solution would be to reissue the failed writes and verify that their results were the same as before (because we're not protected by latches and an earlier write could have slipped in). Now that writes within a transaction are idempotent, this should be possible, although it will add a large amount of complexity and requires additional bookkeeping in the transaction coordinator. In the meantime, we should try to avoid hitting these errors as often. The biggest offenders here are lease transfers. An async write will fail to retry on a |
I don't think there's any deep reason; we just hadn't thought of this before. But I'm not sure I understand the situation. As soon as we transfer the lease away, don't we poison our lease so we won't use it any more? We are fairly quick to return NotLeaseHolderError because we expect that they will always be retried. Why aren't they retried here? |
Yes, but we reject any in-flight proposals. The issue is that this rejection has no-one listening for it for any of these proposals that are asynchronous. Transactions only learn of these failures when they go back to query their intent. |
Ah, got it. So the effect of using a span latch here would be to delay the lease transfer until we flush out the existing writes, not to prevent new writes while the transfer is in progress. That would probably work, although I feel like the loss of automatic retries for these async writes may have other unfortunate consequences and we may want to try to figure out some way for them to be automatically retried by a new owner once they've detached from their original connection. |
It would actually accomplish both. The latter is what I'm more worried about though, as any write proposed while a lease transfer is in progress is bound to fail.
We could be better about retrying without releasing latches if the leaseholder doesn't change (e.g. could we repropose on |
The difference between raft leader and leaseholder strikes again. After the TransferLease command resolves and releases its latches, we immediately begin to transfer raft leadership. It's the raft leadership change that causes the apply-time error. We might be able to hang on to the latches during the raft transfer (or treat the raft transfer as a new pseudo-command that will get its own latches). Or as @nvanbenschoten 's last post said, we could retry in this case as long as the leaseholder doesn't change (which should combine with the addition of full-range latches on lease transfers to be a complete solution). |
We should be able to do that (I'm not quite clear why we don't - probably FUD about changing the MaxLeaseIndex of an existing proposal). |
I think this may date back to some early speculation about propEvalKV, in which we planned to stack proposal batches on top of each other (instead, we rely on the span latches to ensure that proposals always commute). Or maybe it's something that was once necessary but other code changes have rendered moot. In any case, it looks to me now like we ought to be able to update the MaxLeaseIndex and handle everything within refreshProposalsLocked, which would be a nice simplification. |
Aha, the reason behind the current design is that the primary source of |
OK, paging back in the reasons for the This is used for replay protection: We may propose the same command multiple times with the same MaxLeaseIndex. At most one of these can succeed (because a successful application sets LeaseAppliedIndex to its MaxLeaseIndex and all subsequent copies will fail with illegal lease indexes), and once we've seen one illegal lease index any remaining copies are guaranteed to fail. This allows us to retry safely on a lease index error as long as there hasn't been a prior success. The involvement of the caller (the executeWriteBatch loop) in the retry process is currently serving as an indirect test for whether we have seen a prior success (since that would cause the caller loop to exit). It looks like the presence of the proposal in |
I have a fix for this that I'm polishing up.
I've abandoned any span latch changes for lease operations. I think they're unnecessary (We already record that we've started a transfer and don't accept new writes until it finishes. Any writes that predate the transfer are mostly going to be fine; it's rare to get a retry error at that point). More importantly, they have significant performance risks for expiration-based leases - I wouldn't want to see the liveness range become unwriteable while a lease renewal is in progress. I'm going to skip this part unless and until we can demonstrate a more concrete need for it. |
That's true, although lease operations don't wait for existing writes to finish evaluation or replication, so it's possible for them to jump ahead of writes. If we have re-ordering within Raft then we're already lost (mod discussion above) because the writes will hit a Still, I think your intuition about not latching is correct given that we at least block future writes. |
In a pipelined write, the client executeWriteBatch is detached early, so it may not be around to retry in the event of a lease index error. These errors need to be retried at a lower level (or else they will bubble up to the client). In practice, this means that writes that raced with a leadership transfer (especially the first insert into a newly-created table) were likely to generate ASYNC_WRITE_FAILURE retries without this change. Fixes cockroachdb#28876 Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE transaction retry errors, especially for the first insert into a newly-created table.
In a pipelined write, the client executeWriteBatch is detached early, so it may not be around to retry in the event of a lease index error. These errors need to be retried at a lower level (or else they will bubble up to the client). In practice, this means that writes that raced with a leadership transfer (especially the first insert into a newly-created table) were likely to generate ASYNC_WRITE_FAILURE retries without this change. Fixes cockroachdb#28876 Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE transaction retry errors, especially for the first insert into a newly-created table.
#35261 fixes the async write failures, but the test it adds is still flaky under stress (it fails a lot even without stress on teamcity, but locally it passes pretty reliably without stress). Now it gets |
Ah, this is a TransactionAbortedError and not a TransactionRetryError. The logic here was touched by @nvanbenschoten 's recent changes but the issue here is older than that (and mentioned in #33396 (review)). We're using the timestamp cache as the source of truth for certain transaction conflicts, but this means that when the tscache gets flushed on lease transfers we must assume the worst and treat these errors as aborts (which can be retried at the SQL layer or above, but cannot be retried by the span refresher). So this is another way that lease transfers are disruptive for clients that don't implement savepoint retries. Is there anything we can do about it? Using the abort span for operations on transaction records (and not just intents) might be one solution. I'm not sure what performance impact that would have or how much garbage it would leave lying around. |
I take that back. The code around EndTransaction and the timestamp cache has been this way for a long time, but #33523 changed things so that PushTxn updates the write timestamp cache. I think without that we'd be safe to retry after a flushed timestamp cache from the TxnCoordSender. But with PushTxn using the timestamp cache to communicate with EndTxn, we have to treat this situation as a potential abort. |
We could marshall the write timestamp cache around on lease transfers to avoid this disruption, right? |
Yes, that's one option. I don't know how big the write timestamp cache gets in practice. We could probably also filter out some parts of it (to only include entries where the txn id in the payload differs from the one in the key?) |
Lazy question: which transactions exactly are subject to this disruption from the lease transfer? Is it only transactions that hadn't written their txn record yet at the moment the lease transfer happens, or is it all txns that were in-flight at the moment of the transfer? In other words, is the write timestamp cache consulted (by anything having to do with the txn record) even when there is a transaction record? |
It's called only for transactions that hadn't written their txn record yet. But the idea with the new changes is that txn records are only written by heartbeats and commits, and that's part of the speedup. We could make the first heartbeat more aggressive or we could go back to writing the txn record with BeginTxn to reduce the window for this issue, but that would run counter to the work @nvanbenschoten has been doing here. |
Well, I don't know about "counter". I believe the main point of the change was as a preparation for parallel commits, where you don't want anything to wait for the If the wTsCache is too large, I think we can get away by just copying over a portion of it corresponding to the most recently aborted txns, and a corresponding low-watermark. And now I'm thinking - could we also keep the wTsCache small by removing entries from it once the client finds out that its txn has been aborted? |
Yes, as @bdarnell mentioned, this is how things worked for the past few years - 471811e. The write timestamp cache was updated on
I'm curious to hear more about what you have in mind here. Are you saying that if we stopped using the write timestamp cache for PushTxn(ABORT), we could ignore possible replay errors as long as they made it back to the client? What do you mean by "flushed timestamp cache"? I need to think more about how that proposal works, especially with compound replays.
I like the idea of serializing portions of the timestamp cache and shipping them over lease exchanges. This really is an issue with a loss of information during lease transfers, so this seems like the natural place to solve it. I also wouldn't rule out that we might do the same thing for the read timestamp cache sometime in the future too, as the loss of resolution there can also cause retries that might bubble up to the client. "If the wTsCache is too large, I think we can get away by just copying over a portion of it corresponding to the most recently aborted txns, and a corresponding low-watermark" is along the lines of what I was thinking. We would attempt to maintain full resolution in the new leaseholder's timestamp cache (over the txnID portion) within some max time window (1s?). This serialization could be as simple as:
We could read this straight out of the old leaseholder's timestamp cache.
We could, although a structure like the timestamp cache is nice in that this GC is built in as it rolls over as time passes.
Kind of. The problem was that BeginTxn stalled the EndTxn when acquiring latches because they touched the same key and didn't commute. Also, FWIW I'm hoping to eliminate the existence of ABORTED transaction records entirely: see here. However, we might not get a lot from that if we still have persistent abort span entries to protect zombie transactions. |
We wouldn't ignore the replay errors, we would treat them as retryable. My thinking is that if the write timestamp cache is only used for interactions between BeginTxn and EndTxn (and their possible replays), the TxnCoordSender knows enough about what operations it has sent to know when it can safely retry without the possibility of duplicating a successful transaction. (this is assuming that we can treat the POSSIBLE_REPLAY signal as unambiguous and any duplication lower in the stack would cause it to be transformed into an AmbiguousResultError)
I mean the zeroing (or rather low-water-marking) of the timestamp cache that occurs when leases change hands.
Yeah, it seems like a good idea to make lease transfers less costly in general. |
Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). To remedy this, this commit moves reproposals due to illegal lease indexes below raft (but only on the proposing node) In practice, this can happen to writes that race with a raft leadership transfer. This is observable immediately after table creation, since table creation performs a split, and then may perform a lease transfer to balance load (and if the lease is transfer, raft leadership is transferred afterward). Specifically, 1. Lease is transferred from store s1 to s2. s1 is still raft leader. 2. Write w1 evaluates on store s2 and is assigned lease index i1. s2 forwards the proposal to s1. 3. s1 initiates raft leader transfer to s2. This puts it into a temporarily leaderless state so it drops the forwarded proposal. 4. s2 is elected raft leader, completing the transfer. 5. A second write w2 evalutes on s2, is assigned lease index i2, and goes right in the raft log since s2 is both leaseholder and leader. 6. s2 refreshes proposals as a side effect of becoming leader, and writes w1 to the log with lease index i1. 7. s2 applies w2, then w1. w1 fails because of the out of order lease index. If w1 was pipelined, the client stops listening after step 2, and won't learn of the failure until it tries to commit. At this point the commit returns an ASYNC_WRITE_FAILURE retry to the client. Note that in some cases, NotLeaseHolderError can be generated at apply time. These errors cannot be retried (since the proposer has lost its lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error to the client. However, this is uncommon - most NotLeaseHolderErrors are generated at evaluation time, which is compatible with pipelined writes. Fixes cockroachdb#28876 Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE transaction retry errors, especially for the first insert into a newly-created table.
35261: storage: Improve handling of lease index retries r=tbg,nvanbenschoten a=bdarnell Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). To remedy this, this commit moves reproposals due to illegal lease indexes below raft (but only on the proposing node) In practice, this can happen to writes that race with a raft leadership transfer. This is observable immediately after table creation, since table creation performs a split, and then may perform a lease transfer to balance load (and if the lease is transfer, raft leadership is transferred afterward). Specifically, 1. Lease is transferred from store s1 to s2. s1 is still raft leader. 2. Write w1 evaluates on store s2 and is assigned lease index i1. s2 forwards the proposal to s1. 3. s1 initiates raft leader transfer to s2. This puts it into a temporarily leaderless state so it drops the forwarded proposal. 4. s2 is elected raft leader, completing the transfer. 5. A second write w2 evalutes on s2, is assigned lease index i2, and goes right in the raft log since s2 is both leaseholder and leader. 6. s2 refreshes proposals as a side effect of becoming leader, and writes w1 to the log with lease index i1. 7. s2 applies w2, then w1. w1 fails because of the out of order lease index. If w1 was pipelined, the client stops listening after step 2, and won't learn of the failure until it tries to commit. At this point the commit returns an ASYNC_WRITE_FAILURE retry to the client. Note that in some cases, NotLeaseHolderError can be generated at apply time. These errors cannot be retried (since the proposer has lost its lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error to the client. However, this is uncommon - most NotLeaseHolderErrors are generated at evaluation time, which is compatible with pipelined writes. Fixes #28876 The fourth commit is the important part of this change. The fifth is optional; it's a simplification but it changes an under-tested error path. Assuming we like the fifth commit, I'd like to add a sixth that rips out proposalReevaluationReason and the executeWriteBatch/tryExecuteWriteBatch loop altogether. Co-authored-by: Ben Darnell <[email protected]>
#35261 fixed the major identified cause of async_write_failure retry errors, so I'm considering this complete. These errors should now only happen when there is a transaction conflict that wipes out one of our pending intents. |
Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). To remedy this, this commit moves reproposals due to illegal lease indexes below raft (but only on the proposing node) In practice, this can happen to writes that race with a raft leadership transfer. This is observable immediately after table creation, since table creation performs a split, and then may perform a lease transfer to balance load (and if the lease is transfer, raft leadership is transferred afterward). Specifically, 1. Lease is transferred from store s1 to s2. s1 is still raft leader. 2. Write w1 evaluates on store s2 and is assigned lease index i1. s2 forwards the proposal to s1. 3. s1 initiates raft leader transfer to s2. This puts it into a temporarily leaderless state so it drops the forwarded proposal. 4. s2 is elected raft leader, completing the transfer. 5. A second write w2 evalutes on s2, is assigned lease index i2, and goes right in the raft log since s2 is both leaseholder and leader. 6. s2 refreshes proposals as a side effect of becoming leader, and writes w1 to the log with lease index i1. 7. s2 applies w2, then w1. w1 fails because of the out of order lease index. If w1 was pipelined, the client stops listening after step 2, and won't learn of the failure until it tries to commit. At this point the commit returns an ASYNC_WRITE_FAILURE retry to the client. Note that in some cases, NotLeaseHolderError can be generated at apply time. These errors cannot be retried (since the proposer has lost its lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error to the client. However, this is uncommon - most NotLeaseHolderErrors are generated at evaluation time, which is compatible with pipelined writes. Fixes cockroachdb#28876 Release note (bug fix): Reduced the occurrence of ASYNC_WRITE_FAILURE transaction retry errors, especially for the first insert into a newly-created table.
@nvanbenschoten , the
RETRY_ASYNC_WRITE_FAILURE
error I was telling you about is very easy to repro on master, so I'm thinking maybe it's worth verifying that it indeed happens because of a lease transfer. Maybe there's something else too going on (based on nothing but the frequency).I'm running something like
bin/roachtest run interleavedpartitioned --count 100
reusing a cluster (so--cluster <name>
, and more often then not the test fails within 20s or so. Looks like if it doesn't fail in the beginning, it won't fail later on either.I've also changed the duration below to
30s
to speed up the process:cockroach/pkg/cmd/roachtest/interleavedpartitioned.go
Line 62 in 2e905a0
What do you think? Feel free to close or pass back if you're not interested.
I've separately opened #28873 complaining that the test doesn't do retries.
The text was updated successfully, but these errors were encountered: