Retry send only for retriable exceptions #194

praseodym · 2018-06-13T20:46:34Z

Fixes #193
Fixes #190
Fixes #178
Fixes #172

Current situation

Infinite retry for permanent failure (RecordTooLargeException):

[2018-06-13T22:01:44,766][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:01:44,868][WARN ][logstash.outputs.kafka   ] KafkaProducer.send() failed: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept. {:exception=>java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept.}
[2018-06-13T22:01:44,869][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:01:44,971][WARN ][logstash.outputs.kafka   ] KafkaProducer.send() failed: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept. {:exception=>java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept.}
[2018-06-13T22:01:44,971][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:01:45,073][WARN ][logstash.outputs.kafka   ] KafkaProducer.send() failed: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept. {:exception=>java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept.}

New situation

RecordTooLargeException gets dropped and logged:

[2018-06-13T22:34:00,398][WARN ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, dropping record {:exception=>org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept., :event=>"{\"message\":\"something that is larger than max.message.bytes\"@timestamp\":\"2018-06-13T20:34:00.116Z\",\"@version\":\"1\",\"host\":\"hostname\",\"sequence\":0}"}

Transient failures, like a broker restart, are still handled just fine (single server cluster in this case):

[2018-06-13T22:40:31,204][WARN ][org.apache.kafka.clients.NetworkClient] [Producer clientId=producer-1] Connection to node 0 could not be established. Broker may not be available.
[2018-06-13T22:40:31,214][WARN ][org.apache.kafka.clients.NetworkClient] [Producer clientId=producer-1] Connection to node 0 could not be established. Broker may not be available.
[2018-06-13T22:40:31,392][WARN ][org.apache.kafka.clients.NetworkClient] [Producer clientId=producer-1] Error while fetching metadata with correlation id 7 : {test=LEADER_NOT_AVAILABLE}
[2018-06-13T22:40:31,398][WARN ][org.apache.kafka.clients.producer.internals.Sender] [Producer clientId=producer-1] Received unknown topic or partition error in produce request on partition test-0. The topic/partition may not exist or the user may not have Describe access to it
[2018-06-13T22:40:31,402][INFO ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, will retry {:exception=>org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.}
[2018-06-13T22:40:31,402][INFO ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, will retry {:exception=>org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.}
[2018-06-13T22:40:31,402][INFO ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, will retry {:exception=>org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.}
[2018-06-13T22:40:31,402][INFO ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, will retry {:exception=>org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.}
[2018-06-13T22:40:31,403][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:40:31,403][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:40:31,403][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:40:31,403][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}

praseodym · 2018-06-14T17:16:12Z

CI failure looks like a flake

robbavey

@praseodym Thanks for the contribution.

On the whole, I like the change - there is no reason to continue to retry sending messages that will never successfully be sent, but I do have some concerns that we are changing the default behavior to drop events over what could be a misconfiguration/miscalculation of max_request_size, for example, but without DLQ support, we don't really have a great story here.

cc @jsvd for his opinion

praseodym · 2018-06-20T20:50:08Z

Agreed, having a DLQ would be the best solution for this. As the (in my opinion) second best option, I chose to log the entire Logstash event (Kafka record value). This way, it is at least possible to debug what kind of messages are dropped.

praseodym · 2018-07-13T14:18:41Z

ping @robbavey @jsvd; any comments on this?

xraystyle · 2018-10-31T22:22:21Z

Hey everyone, was there ever any movement on this? Seems I've hit this issue in production and it's a bit of a show-stopping problem if enough of these messages build up and are constantly retrying.

praseodym · 2018-11-21T18:55:12Z

This PR still works on the master branch. If there’s anything holding back a merge of this PR, please let me know.

kreiger · 2018-11-27T13:29:23Z

Hey, thanks for this PR, it unblocked our RecordTooLargeExceptions. Looking forward to seeing it merged.

praseodym · 2019-01-06T14:14:57Z

@robbavey @jsvd can we get this merged? It has been over half a year and I'm not sure what we're waiting for.

praseodym · 2019-01-06T16:04:28Z

After rebasing I noticed that the CI failure was caused by an actual problem in the spec (vs. actual behaviour). This is fixed in a new commit.

I also noticed that retrying_send would sometimes retry the wrong message because it uses incorrect array indexing. This problem is present in current releases as well. It is fixed in another new commit in this PR.

robbavey · 2019-02-08T18:06:06Z

@jsvd Are you good to go with this after the most recent push?

Nil values were removed from the futures array before looping, causing wrong indexes relative to the batch array.

praseodym · 2019-03-09T22:11:02Z

Apparently I missed two specs, now fixed.

Pectojin · 2019-07-18T12:09:21Z

Is there any reason this isn't getting merged?

I don't really know what to do with a plugin that will DOS itself to death... It seems pretty critical.

praseodym · 2019-08-04T12:07:42Z

I'm not sure; maybe @robbavey @jsvd have any comments on this?

praseodym · 2020-04-13T14:15:36Z

Closing in favour of logstash-plugins/logstash-integration-kafka#29.

praseodym force-pushed the retry-only-retriable branch from 66598e6 to 8d25cde Compare June 13, 2018 20:51

praseodym mentioned this pull request Jun 13, 2018

Separate behaviour for retriable and non-retriable exceptions #193

Closed

jakelandis requested a review from robbavey June 19, 2018 14:29

jakelandis assigned praseodym and robbavey Jun 19, 2018

praseodym mentioned this pull request Jun 19, 2018

Better handle org.apache.kafka.common.errors.RecordTooLargeException #178

Closed

robbavey reviewed Jun 20, 2018

View reviewed changes

praseodym force-pushed the retry-only-retriable branch from 8d25cde to da74af9 Compare January 6, 2019 14:00

Retry send only for retriable exceptions

37e9c30

praseodym force-pushed the retry-only-retriable branch from 4714fd4 to d746376 Compare February 15, 2019 14:29

praseodym added 2 commits March 9, 2019 23:00

Update spec for new retrying behaviour

a1b9df7

Fix futures array loop in retrying_send

2c72ced

Nil values were removed from the futures array before looping, causing wrong indexes relative to the batch array.

praseodym force-pushed the retry-only-retriable branch from cd63a07 to 2c72ced Compare March 9, 2019 22:00

praseodym mentioned this pull request Mar 27, 2020

Do not retry sending messages that failed with a permanent exception logstash-plugins/logstash-integration-kafka#27

Closed

praseodym closed this Apr 13, 2020

kares mentioned this pull request Jun 8, 2020

Retry sending messages only for retriable exceptions logstash-plugins/logstash-integration-kafka#29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry send only for retriable exceptions #194

Retry send only for retriable exceptions #194

praseodym commented Jun 13, 2018 •

edited

Loading

praseodym commented Jun 14, 2018

robbavey left a comment

praseodym commented Jun 20, 2018

praseodym commented Jul 13, 2018

xraystyle commented Oct 31, 2018

praseodym commented Nov 21, 2018

kreiger commented Nov 27, 2018

praseodym commented Jan 6, 2019

praseodym commented Jan 6, 2019

robbavey commented Feb 8, 2019

praseodym commented Mar 9, 2019

Pectojin commented Jul 18, 2019

praseodym commented Aug 4, 2019

praseodym commented Apr 13, 2020

Retry send only for retriable exceptions #194

Retry send only for retriable exceptions #194

Conversation

praseodym commented Jun 13, 2018 • edited Loading

Current situation

New situation

praseodym commented Jun 14, 2018

robbavey left a comment

Choose a reason for hiding this comment

praseodym commented Jun 20, 2018

praseodym commented Jul 13, 2018

xraystyle commented Oct 31, 2018

praseodym commented Nov 21, 2018

kreiger commented Nov 27, 2018

praseodym commented Jan 6, 2019

praseodym commented Jan 6, 2019

robbavey commented Feb 8, 2019

praseodym commented Mar 9, 2019

Pectojin commented Jul 18, 2019

praseodym commented Aug 4, 2019

praseodym commented Apr 13, 2020

praseodym commented Jun 13, 2018 •

edited

Loading