Async producer retries for failed messages #331

vshlapakov · 2015-02-25T13:31:10Z

Another approach to add retries for failed messages (for asynchronous mode only).

I've extended ProduceRequest a little bit by adding complete retries count to the fields.
It's backward compatible, you can miss this param when creating initial ProduceRequest, it's 0 initially.
(It's not very clean solution in terms of code, but storing retries count per request sounds reasonable.)
I've added several retry params for producers also:
- batch_retry_backoff_ms - backoff timeout to wait before next retry (300 by default);
- batch_retries_limit - upper limit of retries per request
  (5 by default, None means infinite retries, 0 means no retries).
I've used FailedPayloadsException from Improve fault tolerance by handling leadership election and other metadata changes #55 to get failed requests and increment retries for it, or drop it depending on settings.

What do you think about the approach? I'm going to add some additional tests for it asap.

dpkp · 2015-03-09T00:46:40Z

There are several exceptions that could be raised within the send_produce_request() call chain besides FailedPayloadsError:

NotLeaderForPartitionError -- metadata is likely out of date and should be refreshed before retry
LeaderNotAvailableError -- leader election is underway, backoff + refresh metadata before retry
UnknownTopicOrPartitionError -- broker is neither a leader or a replica, possible that a metadata refresh and retry would fix, but also possible that topic just doesnt exist and auto-topic-creation is not enabled.
ConnectionError -- socket / connection error, backoff + refresh metadata + retry
KafkaUnavailableError -- all broker nodes are offline... backoff
RequestTimedOutError -- broker was unable to get all ACKs specified by the Produce request w/i the required timeout. If the user is ok w/ possible duplicate messages, a retry is fine.
Then also a bunch of message related errors: InvalidMessage, InvalidMessageSize, MessageSizeTooLarge -- which generally should not be retried

vshlapakov · 2015-03-11T11:19:27Z

Thanks a lot for the clarifications, it seems that we need more smart retrying logic, I'll think about it. So far I've updated the code a little bit to retry all failed requests without exclusions.

dpkp · 2015-03-24T02:55:28Z

I dont think we can do blanket retries. Many of the errors require resyncing cluster metadata before retry. The error handling logic should handle that.

vshlapakov · 2015-04-21T07:28:12Z

Sorry, removed/restored the branch by accident, I'm on this issue.

vshlapakov · 2015-04-21T16:49:09Z

As this PR is directly related with async producer and retries, I also added an ability to limit async queue size (used some commits from #283 & #304 - look like abandoned).

vshlapakov · 2015-04-22T11:08:37Z

@dpkp I'm still having some odd errors from Travis, but could you check the approach please? I'm not sure if we're fine with having a named tuple as an argument for a producer.

dpkp · 2015-05-13T06:27:34Z

kafka/producer/base.py

-        except Exception:
-            log.exception("Unable to send message")
+
+        except FailedPayloadsError as ex:


now that #366 is merged, we can pass fail_on_error=False to client.send_produce_request and get a list of responses/errors. This would allow us to loop through the responses and only retry the reqs that failed. Currently this exception would cause all reqs in the batch to retry, even if only a single req caused the error.

Ok, thanks for pointing me, I missed that PR!

vshlapakov · 2015-05-13T07:20:19Z

@dpkp Thanks for all your comments, they're extremely helpful, I'll try to fix everything asap.

@rogaha

- send_producer_request with fail_on_error=False to retry failed reqs only - using an internal dict with with namedtuple keys for retry counters - refresh metadata on refresh_error irrespective to retries options - removed infinite retries (retry_options.limit=None) as an over-feature - separate producer init args for retries options (limit,backoff,on_timeouts) - AsyncProducerQueueFull returns a list of failed messages - producer tests improved thanks to @rogaha and @toli

vshlapakov · 2015-06-03T20:05:38Z

Rebased and prepared a draft for response.error check.

dpkp · 2015-06-03T20:45:04Z

awesome! will take a look shortly!

dpkp · 2015-06-04T20:14:38Z

There are a handful of issues remaining -- need to reenable the test_switch_leader_async test in test/test_failover_integration.py and then fix it along w/ the lint issues failing travis build.

But you've put in a ton of work and I think it is basically there, so I am going to merge and push a few fixes directly. Thanks so much for working on this one!

Async producer retries for failed messages

vshlapakov · 2015-06-07T20:05:36Z

Cool, thank you @dpkp! If I have some additional time, I'll fix other issues (like additional tests and docstrings) as well in a separate PR.

dpkp · 2015-06-07T22:45:07Z

Thanks again for your help on this. I put together some edits in #388 (merged yesterday).

dpkp added the producer label Mar 3, 2015

dpkp added this to the 0.9.4 Release milestone Mar 24, 2015

vshlapakov force-pushed the feature-producer-retries branch from bc8965a to 8f45421 Compare March 24, 2015 19:22

dpkp mentioned this pull request Apr 1, 2015

producer raises FailedPayloadsError #362

Closed

vshlapakov closed this Apr 21, 2015

vshlapakov deleted the feature-producer-retries branch April 21, 2015 07:21

vshlapakov restored the feature-producer-retries branch April 21, 2015 07:21

vshlapakov reopened this Apr 21, 2015

vshlapakov force-pushed the feature-producer-retries branch 2 times, most recently from 400f10e to 59faeca Compare April 21, 2015 08:17

vshlapakov changed the title ~~Async producer retries for failed messages~~ Async producer retries for failed messages [WIP] Apr 21, 2015

vshlapakov changed the title ~~Async producer retries for failed messages [WIP]~~ [WIP] Async producer retries for failed messages Apr 21, 2015

vshlapakov changed the title ~~[WIP] Async producer retries for failed messages~~ Async producer retries for failed messages Apr 21, 2015

dpkp mentioned this pull request May 13, 2015

async producer has unbounded internal queue #297

Closed

dpkp reviewed May 13, 2015
View reviewed changes

This was referenced May 13, 2015

Added upper-bound size limit to the kafka producer #375

Closed

Add producer batch send queue size limit #304

Closed

Add retry and queue size on async producer #283

Closed

vshlapakov and others added 17 commits June 3, 2015 11:22

Fix: check failed reqs to retry only for positive limit

566e408

Returned default behaviour with no retries

b0a0459

Fixed tests and other issues after rebase

5e8dc6d

Improved retry logic

09c1c8b

Arg fixes for base/keyed producers

0e0f794

Clean and simplify retry logic

b311145

Disable retry on timeouts by default (old behaviour)

c165f17

add producer send batch queue overfilled test

cf36308

async queue: refactored code; add one more test

f41e5f3

Fix small issues with names/tests

948e046

Simplification of retry logic

7da48f6

Fix names for async retries opts, add timeout for put

5119bb6

Fix async producer queue put arguments

91af27c

Change backoff message log level

9eed169

Increase producer test timeout

4c682f3

Async producer: py2.6 backward compatibility fix

4474a50

vshlapakov force-pushed the feature-producer-retries branch from 93ef46d to 4474a50 Compare June 3, 2015 08:23

Check response.error for async producer

7d6f3f5

vshlapakov force-pushed the feature-producer-retries branch from 8916dbc to 7d6f3f5 Compare June 3, 2015 20:09

dpkp added a commit that referenced this pull request Jun 4, 2015

Merge pull request #331 from vshlapakov/feature-producer-retries

474aeaa

Async producer retries for failed messages

dpkp merged commit 474aeaa into dpkp:master Jun 4, 2015

dpkp mentioned this pull request Jun 7, 2015

how to close async producer #291

Closed

dpkp mentioned this pull request Jun 8, 2015

Add retry support for producer #40

Closed

vshlapakov deleted the feature-producer-retries branch June 10, 2015 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async producer retries for failed messages #331

Async producer retries for failed messages #331

vshlapakov commented Feb 25, 2015

dpkp commented Mar 9, 2015

vshlapakov commented Mar 11, 2015

dpkp commented Mar 24, 2015

vshlapakov commented Apr 21, 2015

vshlapakov commented Apr 21, 2015

vshlapakov commented Apr 22, 2015

dpkp May 13, 2015

vshlapakov May 13, 2015

vshlapakov commented May 13, 2015

vshlapakov commented Jun 3, 2015

dpkp commented Jun 3, 2015

dpkp commented Jun 4, 2015

vshlapakov commented Jun 7, 2015

dpkp commented Jun 7, 2015

Async producer retries for failed messages #331

Async producer retries for failed messages #331

Conversation

vshlapakov commented Feb 25, 2015

dpkp commented Mar 9, 2015

vshlapakov commented Mar 11, 2015

dpkp commented Mar 24, 2015

vshlapakov commented Apr 21, 2015

vshlapakov commented Apr 21, 2015

vshlapakov commented Apr 22, 2015

dpkp May 13, 2015

Choose a reason for hiding this comment

vshlapakov May 13, 2015

Choose a reason for hiding this comment

vshlapakov commented May 13, 2015

vshlapakov commented Jun 3, 2015

dpkp commented Jun 3, 2015

dpkp commented Jun 4, 2015

vshlapakov commented Jun 7, 2015

dpkp commented Jun 7, 2015