Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RequestTimeoutOverhead appears to override Context cancellation #769

Closed
JFlath opened this issue Jun 14, 2024 · 5 comments
Closed

RequestTimeoutOverhead appears to override Context cancellation #769

JFlath opened this issue Jun 14, 2024 · 5 comments

Comments

@JFlath
Copy link

JFlath commented Jun 14, 2024

In the below example we see a produce request fail (due to a broker restart) and repeated attempts to connect to that broker with the default RequestTimeoutOverhead of 10s:

2024-06-04T12:55:45.348+0100    DEBUG   producer_client kzap/kzap.go:110        retry batches processed {"wanted_metadata_update": true, "triggering_metadata_update": true, "should_backoff": false}
2024-06-04T12:55:45.348+0100    INFO    producer_client kzap/kzap.go:114        metadata update triggered       {"why": "failed produce request triggered metadata update"}
2024-06-04T12:55:46.114+0100    DEBUG   producer_client kzap/kzap.go:110        opening connection to broker    {"addr": "redpanda-1.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "1"}
2024-06-04T12:55:46.115+0100    WARN    producer_client kzap/kzap.go:116        unable to open connection to broker     {"addr": "redpanda-1.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "1", "err": "dial tcp 10.XX.XX.XX:9092: connect: connection refused"}

2024-06-04T12:55:46.343+0100    DEBUG   producer_client kzap/kzap.go:110        opening connection to broker    {"addr": "redpanda-1.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "1"}
2024-06-04T12:55:56.352+0100    WARN    producer_client kzap/kzap.go:116        unable to open connection to broker     {"addr": "redpanda-1.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "1", "err": "dial tcp 10.XX.XX.XX:9092: i/o timeout"}

2024-06-04T12:55:56.352+0100    DEBUG   producer_client kzap/kzap.go:110        opening connection to broker    {"addr": "redpanda-1.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "1"}
2024-06-04T12:56:05.717+0100    DEBUG   producer_client kzap/kzap.go:110        reaped connections      {"time_since_last_reap": "20.009282863s", "reap_dur": "127.7µs", "num_reaped": 4}
2024-06-04T12:56:06.352+0100    WARN    producer_client kzap/kzap.go:116        unable to open connection to broker     {"addr": "redpanda-1.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "1", "err": "dial tcp 10.XX.XX.XX:9092: i/o timeout"}

2024-06-04T12:56:06.352+0100    DEBUG   producer_client kzap/kzap.go:110        opening connection to broker    {"addr": "redpanda-2.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "2"}
2024-06-04T12:56:06.353+0100    DEBUG   producer_client kzap/kzap.go:110        connection opened to broker     {"addr": "redpanda-2.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "2"}
2024-06-04T12:56:06.353+0100    DEBUG   producer_client kzap/kzap.go:110        connection initialized successfully     {"addr": "redpanda-2.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "2"}
2024-06-04T12:56:06.353+0100    DEBUG   producer_client kzap/kzap.go:110        wrote Metadata v7       {"broker": "2", "bytes_written": 34, "write_wait": "884.161µs", "time_to_write": "18.209µs", "err": null}
2024-06-04T12:56:06.354+0100    DEBUG   producer_client kzap/kzap.go:110        read Metadata v7        {"broker": "2", "bytes_read": 6694, "read_wait": "42.38µs", "time_to_read": "763.691µs", "err": null}

2024-06-04T12:56:06.355+0100    DEBUG   producer_client kzap/kzap.go:110        metadata refresh topic partition data changed   {"topic": "test_topic", "partition": 122, "new_leader": 2, "new_leader_epoch": 6, "old_leader": 1, "old_leader_epoch": 5}
2024-06-04T12:56:06.355+0100    DEBUG   producer_client kzap/kzap.go:110        opening connection to broker    {"addr": "redpanda-2.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "2"}
2024-06-04T12:56:06.355+0100    DEBUG   producer_client kzap/kzap.go:110        connection opened to broker     {"addr": "redpanda-2.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "2"}
2024-06-04T12:56:06.355+0100    DEBUG   producer_client kzap/kzap.go:110        connection initialized successfully     {"addr": "redpanda-2.redpanda.my_cluster.svc.cluster.local.:9092", "broker": "2"}

2024-06-04T12:56:06.355+0100    DEBUG   producer_client kzap/kzap.go:110        wrote Produce v7        {"broker": "2", "bytes_written": 131, "write_wait": "595.04µs", "time_to_write": "13.61µs", "err": null}
2024-06-04T12:56:06.357+0100    DEBUG   producer_client kzap/kzap.go:110        read Produce v7 {"broker": "2", "bytes_read": 62, "read_wait": "35.33µs", "time_to_read": "1.56906ms", "err": null}
2024-06-04T12:56:06.357+0100    DEBUG   producer_client kzap/kzap.go:110        produced        {"broker": "2", "to": "test_topic[122{1206=>1207}]"}
2024-06-04T12:56:06.357+0100    INFO    app     rb_produce_test/main.go:78      OK      {"partition": 122, "offset": 1206, "duration": "22.251886295s", "slow": true}

However the request was made with a Context passed down with a timeout of 1s:

var 		timeout    = 1 * time.Second

ctx, cancel := context.WithTimeout(context.Background(), timeout)
record := kgo.Record{
Value: []byte("payload"),
Key:   []byte("kev"),
Topic: "test_topic",
}

_, err := c.ProduceSync(ctx, &record).First()

It looks like if we're in a connection retry loop, we might be waiting for that loop to exit before processing the Context cancellation?

Full logs and example code shared out of band.

@ericmanlol
Copy link

https://redpandadata.slack.com/archives/C03ALBZ276U/p1718371892420319?thread_ts=1717421131.859379&cid=C03ALBZ276U

I'm(and I'm sure others) are unable to view this at all without a user account, it's prompting me to login, what is it?

@JFlath
Copy link
Author

JFlath commented Jul 12, 2024

@ericmanlol Yeah, I'm afraid that's intentional. I appreciate it's not really good form to put non-public-followable links in a public issue, but sadly Github doesn't have a great solution to sharing data that's relevant to a public issue but which isn't sutable for sharing publicly.

In this instance, it's a private conversation discussion the context in which this issue was seen in the wild, as that context is relevant to the maintainer, but doesn't have an impact on the technical aspects of the issue.

@twmb
Copy link
Owner

twmb commented Jul 18, 2024

Context cancellation for records is inspected before a produce request is sent OR after a produce request is sent. Only the current "head" record in a partition is inspected -- that is, the first record in the batch that is being written.

You can see the context inspected here in maybeFailErr:

franz-go/pkg/kgo/sink.go

Lines 1423 to 1427 in a5f2b71

if len(b.records) > 0 {
ctx := b.records[0].ctx
select {
case <-ctx.Done():
return ctx.Err()

You can see that maybeFailErr is checked as a request is being written (before being sent) here:

franz-go/pkg/kgo/sink.go

Lines 1633 to 1635 in a5f2b71

if recBuf.batches[0] == batch {
if !p.idempotent() || batch.canFailFromLoadErrs {
if err := batch.maybeFailErr(&batch.owner.cl.cfg); err != nil {
-- note this only applies if the batch can fail.

You can see maybeFailErr checked after request failure here:

franz-go/pkg/kgo/sink.go

Lines 945 to 946 in a5f2b71

if canFail || s.cl.cfg.disableIdempotency {
if err := batch.maybeFailErr(&s.cl.cfg); err != nil {

It is checked in one other location which isn't relevant for this issue.

The problem that is happening here is actually not in the logs in the issue report, but in logs that come a bit earlier:

2024-06-04T11:55:45.114Z        DEBUG   producer_client kzap/kzap.go:110        wrote Produce v7        {"broker": "1", "bytes_written": 131, "write_wait": "16.95µs", "time_to_write": "19.28µs", "err": null}
2024-06-04T11:55:45.114Z        DEBUG   producer_client kzap/kzap.go:110        read Produce v7 {"broker": "1", "bytes_read": 0, "read_wait": "50.59µs", "time_to_read": "3.19µs", "err": "EOF"}
2024-06-04T11:55:45.114Z        DEBUG   producer_client kzap/kzap.go:110        read from broker errored, killing connection    {"addr": "redpanda-1.redpanda.levente.svc.cluster.local.:9092", "broker": "1", "successful_reads": 7651, "err": "EOF"}

At this point, the client has written a produce request but has NOT received a response. The client cannot assume either which way about the status of whether the broker actually received and processed the request (and the response was lost) or if the broker never received the request at all.

One key thing to note is that if you are producing with idempotency configured, then every record produced has a sequence number that must be one higher than the prior sequence number. The only way to reset sequence numbers is if you get a new producer ID or if you bump the epoch being used for the current producer ID.

There are two scenarios:

  • The broker received and processed the request, but the response was lost. Let's say the client allowed the records to fail be failed (i.e., what is being requested in this issue). When the buffered records are failed, the sequence number is reset to the last known produced sequence number. The next time you produce, the client will re-use a sequence number that is actually already on the broker (because the request was processed!). The client will receive an either an OutOfOrderSequenceNumber error. This error is actually used to indicate data loss occurred. There is no way for the client to know the actual sequence number it should be producing at or if any data was lost (tbh there could be improvements within Kafka here to actually indicate the status of things beter). The only thing the client can do is to bump the producer epoch and reset sequence numbers internally. Unfortunately, the produce ID and sequence numbers exist to prevent duplicates, so the very process of resetting means we have a chance to allow duplicate data.

  • The broker did not receive the request. In this case, if we failed the records, everything would work perfectly and behave as you'd like.

Unfortunately, we can't assume the latter case, so I've implemented the pessimistic view that produce requests that are written but do not receive a response prevent any partitions in that produce request from having their records failed.


That said, before I looked into the logs more and actually figured to understand the issue, I assumed this was due to the context being canceled before a producer ID was being received, and that the producer ID request was repeatedly failing, so I also went ahead and implemented the possibility to fail due to context cancelation in one more location. I can push that.

twmb added a commit that referenced this issue Jul 18, 2024
If a record's context is canceled, we now allow it to be failed in two
more locations:

* while the producer ID is loading -- we can actually now cancel the
  producer ID loading request (which may also benefit people using
  transactions that want to force quit the client)

* while a sink is backing off due to request failures

For people using transactions, canceling a context now allows you to
force quit in more areas, but the same caveat applies: your client will
likely end up in an invalid transactional state and be unable to
continue.

For #769.
twmb added a commit that referenced this issue Jul 20, 2024
If a record's context is canceled, we now allow it to be failed in two
more locations:

* while the producer ID is loading -- we can actually now cancel the
  producer ID loading request (which may also benefit people using
  transactions that want to force quit the client)

* while a sink is backing off due to request failures

For people using transactions, canceling a context now allows you to
force quit in more areas, but the same caveat applies: your client will
likely end up in an invalid transactional state and be unable to
continue.

For #769.
twmb added a commit that referenced this issue Jul 29, 2024
If a record's context is canceled, we now allow it to be failed in two
more locations:

* while the producer ID is loading -- we can actually now cancel the
  producer ID loading request (which may also benefit people using
  transactions that want to force quit the client)

* while a sink is backing off due to request failures

For people using transactions, canceling a context now allows you to
force quit in more areas, but the same caveat applies: your client will
likely end up in an invalid transactional state and be unable to
continue.

For #769.
twmb added a commit that referenced this issue Jul 29, 2024
If a record's context is canceled, we now allow it to be failed in two
more locations:

* while the producer ID is loading -- we can actually now cancel the
  producer ID loading request (which may also benefit people using
  transactions that want to force quit the client)

* while a sink is backing off due to request failures

For people using transactions, canceling a context now allows you to
force quit in more areas, but the same caveat applies: your client will
likely end up in an invalid transactional state and be unable to
continue.

For #769.
@twmb
Copy link
Owner

twmb commented Jul 29, 2024

Closing due to the above explanation.

@twmb twmb closed this as completed Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants