-
-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RequestTimeoutOverhead appears to override Context cancellation #769
Comments
I'm(and I'm sure others) are unable to view this at all without a user account, it's prompting me to login, what is it? |
@ericmanlol Yeah, I'm afraid that's intentional. I appreciate it's not really good form to put non-public-followable links in a public issue, but sadly Github doesn't have a great solution to sharing data that's relevant to a public issue but which isn't sutable for sharing publicly. In this instance, it's a private conversation discussion the context in which this issue was seen in the wild, as that context is relevant to the maintainer, but doesn't have an impact on the technical aspects of the issue. |
Context cancellation for records is inspected before a produce request is sent OR after a produce request is sent. Only the current "head" record in a partition is inspected -- that is, the first record in the batch that is being written. You can see the context inspected here in Lines 1423 to 1427 in a5f2b71
You can see that Lines 1633 to 1635 in a5f2b71
You can see Lines 945 to 946 in a5f2b71
It is checked in one other location which isn't relevant for this issue. The problem that is happening here is actually not in the logs in the issue report, but in logs that come a bit earlier:
At this point, the client has written a produce request but has NOT received a response. The client cannot assume either which way about the status of whether the broker actually received and processed the request (and the response was lost) or if the broker never received the request at all. One key thing to note is that if you are producing with idempotency configured, then every record produced has a sequence number that must be one higher than the prior sequence number. The only way to reset sequence numbers is if you get a new producer ID or if you bump the epoch being used for the current producer ID. There are two scenarios:
Unfortunately, we can't assume the latter case, so I've implemented the pessimistic view that produce requests that are written but do not receive a response prevent any partitions in that produce request from having their records failed. That said, before I looked into the logs more and actually figured to understand the issue, I assumed this was due to the context being canceled before a producer ID was being received, and that the producer ID request was repeatedly failing, so I also went ahead and implemented the possibility to fail due to context cancelation in one more location. I can push that. |
If a record's context is canceled, we now allow it to be failed in two more locations: * while the producer ID is loading -- we can actually now cancel the producer ID loading request (which may also benefit people using transactions that want to force quit the client) * while a sink is backing off due to request failures For people using transactions, canceling a context now allows you to force quit in more areas, but the same caveat applies: your client will likely end up in an invalid transactional state and be unable to continue. For #769.
If a record's context is canceled, we now allow it to be failed in two more locations: * while the producer ID is loading -- we can actually now cancel the producer ID loading request (which may also benefit people using transactions that want to force quit the client) * while a sink is backing off due to request failures For people using transactions, canceling a context now allows you to force quit in more areas, but the same caveat applies: your client will likely end up in an invalid transactional state and be unable to continue. For #769.
If a record's context is canceled, we now allow it to be failed in two more locations: * while the producer ID is loading -- we can actually now cancel the producer ID loading request (which may also benefit people using transactions that want to force quit the client) * while a sink is backing off due to request failures For people using transactions, canceling a context now allows you to force quit in more areas, but the same caveat applies: your client will likely end up in an invalid transactional state and be unable to continue. For #769.
If a record's context is canceled, we now allow it to be failed in two more locations: * while the producer ID is loading -- we can actually now cancel the producer ID loading request (which may also benefit people using transactions that want to force quit the client) * while a sink is backing off due to request failures For people using transactions, canceling a context now allows you to force quit in more areas, but the same caveat applies: your client will likely end up in an invalid transactional state and be unable to continue. For #769.
Closing due to the above explanation. |
In the below example we see a produce request fail (due to a broker restart) and repeated attempts to connect to that broker with the default
RequestTimeoutOverhead
of 10s:However the request was made with a
Context
passed down with a timeout of 1s:It looks like if we're in a connection retry loop, we might be waiting for that loop to exit before processing the
Context
cancellation?Full logs and example code shared out of band.
The text was updated successfully, but these errors were encountered: