Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry send only for retriable exceptions #194

Closed

Conversation

praseodym
Copy link

@praseodym praseodym commented Jun 13, 2018

Fixes #193
Fixes #190
Fixes #178
Fixes #172

Current situation

Infinite retry for permanent failure (RecordTooLargeException):

[2018-06-13T22:01:44,766][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:01:44,868][WARN ][logstash.outputs.kafka   ] KafkaProducer.send() failed: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept. {:exception=>java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept.}
[2018-06-13T22:01:44,869][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:01:44,971][WARN ][logstash.outputs.kafka   ] KafkaProducer.send() failed: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept. {:exception=>java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept.}
[2018-06-13T22:01:44,971][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:01:45,073][WARN ][logstash.outputs.kafka   ] KafkaProducer.send() failed: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept. {:exception=>java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept.}

New situation

RecordTooLargeException gets dropped and logged:

[2018-06-13T22:34:00,398][WARN ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, dropping record {:exception=>org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept., :event=>"{\"message\":\"something that is larger than max.message.bytes\"@timestamp\":\"2018-06-13T20:34:00.116Z\",\"@version\":\"1\",\"host\":\"hostname\",\"sequence\":0}"}

Transient failures, like a broker restart, are still handled just fine (single server cluster in this case):

[2018-06-13T22:40:31,204][WARN ][org.apache.kafka.clients.NetworkClient] [Producer clientId=producer-1] Connection to node 0 could not be established. Broker may not be available.
[2018-06-13T22:40:31,214][WARN ][org.apache.kafka.clients.NetworkClient] [Producer clientId=producer-1] Connection to node 0 could not be established. Broker may not be available.
[2018-06-13T22:40:31,392][WARN ][org.apache.kafka.clients.NetworkClient] [Producer clientId=producer-1] Error while fetching metadata with correlation id 7 : {test=LEADER_NOT_AVAILABLE}
[2018-06-13T22:40:31,398][WARN ][org.apache.kafka.clients.producer.internals.Sender] [Producer clientId=producer-1] Received unknown topic or partition error in produce request on partition test-0. The topic/partition may not exist or the user may not have Describe access to it
[2018-06-13T22:40:31,402][INFO ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, will retry {:exception=>org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.}
[2018-06-13T22:40:31,402][INFO ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, will retry {:exception=>org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.}
[2018-06-13T22:40:31,402][INFO ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, will retry {:exception=>org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.}
[2018-06-13T22:40:31,402][INFO ][logstash.outputs.kafka   ] KafkaProducer.send() future failed, will retry {:exception=>org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.}
[2018-06-13T22:40:31,403][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:40:31,403][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:40:31,403][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}
[2018-06-13T22:40:31,403][INFO ][logstash.outputs.kafka   ] Sending batch to Kafka failed. Will retry after a delay. {:batch_size=>1, :failures=>1, :sleep=>0.1}

@praseodym
Copy link
Author

CI failure looks like a flake

Copy link
Contributor

@robbavey robbavey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praseodym Thanks for the contribution.

On the whole, I like the change - there is no reason to continue to retry sending messages that will never successfully be sent, but I do have some concerns that we are changing the default behavior to drop events over what could be a misconfiguration/miscalculation of max_request_size, for example, but without DLQ support, we don't really have a great story here.

cc @jsvd for his opinion

@praseodym
Copy link
Author

Agreed, having a DLQ would be the best solution for this. As the (in my opinion) second best option, I chose to log the entire Logstash event (Kafka record value). This way, it is at least possible to debug what kind of messages are dropped.

@praseodym
Copy link
Author

ping @robbavey @jsvd; any comments on this?

@xraystyle
Copy link

Hey everyone, was there ever any movement on this? Seems I've hit this issue in production and it's a bit of a show-stopping problem if enough of these messages build up and are constantly retrying.

@praseodym
Copy link
Author

This PR still works on the master branch. If there’s anything holding back a merge of this PR, please let me know.

@kreiger
Copy link

kreiger commented Nov 27, 2018

Hey, thanks for this PR, it unblocked our RecordTooLargeExceptions. Looking forward to seeing it merged.

@praseodym praseodym force-pushed the retry-only-retriable branch from 8d25cde to da74af9 Compare January 6, 2019 14:00
@praseodym
Copy link
Author

@robbavey @jsvd can we get this merged? It has been over half a year and I'm not sure what we're waiting for.

@praseodym
Copy link
Author

After rebasing I noticed that the CI failure was caused by an actual problem in the spec (vs. actual behaviour). This is fixed in a new commit.

I also noticed that retrying_send would sometimes retry the wrong message because it uses incorrect array indexing. This problem is present in current releases as well. It is fixed in another new commit in this PR.

@robbavey
Copy link
Contributor

robbavey commented Feb 8, 2019

@jsvd Are you good to go with this after the most recent push?

@praseodym praseodym force-pushed the retry-only-retriable branch from 4714fd4 to d746376 Compare February 15, 2019 14:29
Nil values were removed from the futures array before looping, causing
wrong indexes relative to the batch array.
@praseodym praseodym force-pushed the retry-only-retriable branch from cd63a07 to 2c72ced Compare March 9, 2019 22:00
@praseodym
Copy link
Author

Apparently I missed two specs, now fixed.

@Pectojin
Copy link

Is there any reason this isn't getting merged?

I don't really know what to do with a plugin that will DOS itself to death... It seems pretty critical.

@praseodym
Copy link
Author

I'm not sure; maybe @robbavey @jsvd have any comments on this?

@praseodym
Copy link
Author

Closing in favour of logstash-plugins/logstash-integration-kafka#29.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants