Failure tolerance #17

komex · 2016-05-27T10:22:00Z

I have a little qurstion about how to guarantee message delivery to Kafka. If there are a large stream of important messages, how can we be sure of delivery?
For example, we run Bruce in docker in write to socket message. Bruce read it and temporary store in memory for batch send. What will happens if at that moment docker container crashed (killed)? Will messages from memory be missed? If Yes, how can we guarantee message delivery?

dspeterson · 2016-05-27T16:12:52Z

If the container that Bruce is running in crashes, you will lose any messages that Bruce has stored internally and not yet sent. Additionally you may lose messages that Bruce has sent but is holding until it receives an ACK from Kafka indicating successful delivery, since it's always possible that Kafka may return an error ACK, requiring Bruce to resend.

A certain amount of this type of potential message loss is unavoidable, given that host machines and/or Docker containers can not be made perfectly reliable. However, it would be nice to have an easy way to determine that at a certain point in time, all messages with timestamps <= T have been either delivered successfully to Kafka or discarded by Bruce (and reported as discards through Bruce's web interface). I have plans to add functionality to Bruce that will facilitate making this determination. Let's refer to T as the commit point. One can imagine Bruce keeping track of a global commit point for all messages, or perhaps a separate commit point for each Kafka topic, and reporting this information through its web interface. I haven't yet decided which of these (or perhaps both) to provide.

Once this is available, clients can send periodic heartbeat messages through Bruce, with each heartbeat containing the latest commit point information. Then when a downstream consumer sees a heartbeat referencing commit point T in a message stream it is consuming, it knows that at that point in the stream, it has seen all messages it will ever see with timestamps <= T. To facilitate sending heartbeats, support can be added for sending a broadcast or multicast message through Bruce (to all topics, or a client-provided list of topics).

It's possible to redesign Bruce's discard reporting web interface around the concept of commit points. Specifically, when asked for a discard report, Bruce can provide counts of messages discarded for various reasons, and successfully delivered, up to the current commit point. Then suppose we have two discard reports, R1 and R2, such that R1 references commit point T1, and R2 references commit point T2 > T1. To see how many messages were discarded for a given reason between T1 and T2, we can subtract the appropriate discard count in R1 from the corresponding discard count in R2. Monitoring infrastructure can then choose how frequently to ask Bruce for discard reports, and thus determine the granularity of the obtained data quality information. To ensure that discards are visible without unnecessary delay, a discard report can also contain counts of messages discarded for various reasons whose timestamps are greater than the current commit point. For backward compatibility, Bruce can continue to support its current discard reporting mechanism as well as the one described here.

Look for these features or something similar in a future version of Bruce. Although message loss of the type you describe can not be totally prevented, the above-mentioned functionality would allow one to determine with more certainty which parts of a message stream are known to be intact, and which parts may suffer from message loss. For instance, after a Docker container crash it may be useful to know that any resulting message loss is limited to messages with timestamps > T. Let me know if you have other ideas for improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure tolerance #17

Failure tolerance #17

komex commented May 27, 2016

dspeterson commented May 27, 2016 •

edited

Loading

Failure tolerance #17

Failure tolerance #17

Comments

komex commented May 27, 2016

dspeterson commented May 27, 2016 • edited Loading

dspeterson commented May 27, 2016 •

edited

Loading