Nakadi clients resilience to partial outage and partial success #181

adyach · 2023-09-12T08:50:51Z

Nakadi publishing API accepts events in batches. It can fail to publish some events from the batch to underlying storage (Apache Kafka). In that case Nakadi publishing API will return error that batch was partially successful.
It can create problems the following problems, depending on how the Nakadi client and the publishing application deals with this partial success response:

increase in traffic on Nakadi publishing API due to Nakadi clients retrying the whole batch over and over
the application retries identical batches which prevents application from progressing

The following should be done to decrease the possibility of mentioned problems:

Nakadi client should contain a note to developers that publishing can experience partial success. This should be in the client documentation and ideally also within the self contained code documentation, raising awareness for the users, e.g. via docstrings.
An optional retry method on batch level can be provided for the whole batch, but the default strategy must contain a backoff solution in case of continued errors to publish to Nakadi.
An optional retry method can be provided that only re-publishes unsuccessful events to Nakadi. This retry must also support a backoff strategy by default.
Clients must expose the result of a publishing request in a way that developers can understand that there is the possibility of a partial success for batch publishing.

ePaul · 2023-09-12T10:40:17Z

Thanks for the issue (I was already starting to write my own, but you were faster).

Just to make it clear, Nakadi-Producer is not a Nakadi-client by itself, but implementing the outbox pattern based on Fahrschein for the submission, following the "eventual submission" paradigm.
After an application fires an event (i.e. it's stored in the eventlog table in the database), the library takes care of eventually sending it out.

Current behavior

Events for which the submission succeeded will be deleted from the eventlog table (and not retried.)
All failed events will not be deleted from the eventlog table.
The failed events will stay locked for a configured time (lock-duration, 600 seconds (10 minutes) by default).
The next time the submission job is running (by default every second on each instance) after the lock time runs out, these events will be retried (and they will commonly be at the start of the queue, because they are the oldest).
There is no additional backoff mechanism (which would increase the delay or similar).

So short failures (< 10 minutes) of one partition will just repair themselves (and successful events will not be retried), but way longer failures will create similar patterns of "retrying increasingly growing batches of events which could not be submitted before, plus a few new ones".
(With a configured lock-size [default is currently "unlimited"], the batches will not grow without bound, but this in turn can mean that eventually (when you get that many failed events that always the ones where the lock just expired are as much as the lock-size, and they still fail) no other events will be submitted.

Expected Behavior

I think we need these additional measures:

Events which failed multiple times are retried less often (backoff strategy)
Previously failed events are queued after events which didn't fail before. (Or between them, but not at the front.)
The documentation should be clearer.

We also got related #172 for handling of validation errors – there the behavior is even worse, as all the events in one batch will be blocked when some of them have validation failures.

adyach · 2023-09-12T13:20:48Z

Great to hear you are aware of the problem and have ways to address it! Thank you for the details!

Previously failed events are queued after events which didn't fail before. (Or between them, but not at the front.)

Yes, it is either you change the queuing of the events(1) or you change the way how you grab them from the queue(2). I would prefer the second option if possible because then you can mix them up into the batch once they reach the retry time or completely retry only failed events. probably implementation will be easier (I do not know the details though). Anyways it has to be done to not blocking publishing to available partitions.

(With a configured lock-size [default is currently "unlimited"], the batches will not grow without bound, but this in turn can mean that eventually (when you get that many failed events that always the ones where the lock just expired are as much as the lock-size, and they still fail) no other events will be submitted.

if my understand correct, that means after some time Nakadi will receive unlimited batch of events, could we change the default here to some reasonable value from your experience?

ePaul · 2023-09-12T14:11:18Z

Great to hear you are aware of the problem and have ways to address it! Thank you for the details!

Previously failed events are queued after events which didn't fail before. (Or between them, but not at the front.)

Yes, it is either you change the queuing of the events(1) or you change the way how you grab them from the queue(2). I would prefer the second option if possible because then you can mix them up into the batch once they reach the retry time or completely retry only failed events. probably implementation will be easier (I do not know the details though). Anyways it has to be done to not blocking publishing to available partitions.

"Queue" here was more figurative – it's actually more like an unordered set, I now realize looking at the code. There is the table of all events which need sending out, and the scheduler will repeatedly lock some events (events which are already locked are skipped by this), then try to send these events out, then delete the successful ones.
The remaining ones stay locked, until their lock time expires (by default for 10 minutes).

The lock some events part is actually (other than I remembered) not specifying any ordering – I guess I mixed this up with some other service of one of my teams where this is implemented manually, and it had some ORDER BY eid ASC or similar.

I guess we could add something like ORDER BY COALESCE(locked_until, created) to the "lock" and "fetch locked" queries, so recently unlocked ones are just put in-between the new ones.

(With a configured lock-size [default is currently "unlimited"], the batches will not grow without bound, but this in turn can mean that eventually (when you get that many failed events that always the ones where the lock just expired are as much as the lock-size, and they still fail) no other events will be submitted.

if my understand correct, that means after some time Nakadi will receive unlimited batch of events, could we change the default here to some reasonable value from your experience?

If the events are produced continuously, they will be first locked at different times, so at every time only a small number of these locks will expire. Only these ones will be retried at every time, so I would normally not expect a really large number.

Though if the producer is switched off for a while, and then switched on again, it could send out this huge batch at this point. We got this problem in the past (leading to an out-of-memory error in the producer), which is why we introduced this option at all.

Back then, we kept the default at "unlimited" to stay compatible with the previous behavior. Unfortunately a reasonable limit here depends quite strongly on the size of events, so it's difficult to define a generic default in the library. Instead, we planned to make the configuration option a required one with some major version bump.

I now created #182 to keep track of this.

adyach · 2023-10-09T07:57:33Z

@ePaul any updates on the ticket ?

ePaul · 2023-10-12T15:22:44Z

Current status:

Lock-size: reasonable default or required? #182 about the default value of lock-size parameter (or making it mandatory) has two options, with a PR draft for each. I got no feedback on any of it, and then did go on vacation and forgot about it.
I did not look into back-off strategies, as the default 10-minutes one seemed enough for now.
I also didn't look into changing the ordering.

ePaul added nakadi-submission documentation labels Sep 12, 2023

This was referenced Sep 12, 2023

Lock-size: reasonable default or required? #182

Open

Nakadi clients resilience to partial outage and partial success zalando-nakadi/fahrschein#415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nakadi clients resilience to partial outage and partial success #181

Nakadi clients resilience to partial outage and partial success #181

adyach commented Sep 12, 2023

ePaul commented Sep 12, 2023

adyach commented Sep 12, 2023

ePaul commented Sep 12, 2023 •

edited

Loading

adyach commented Oct 9, 2023

ePaul commented Oct 12, 2023

Nakadi clients resilience to partial outage and partial success #181

Nakadi clients resilience to partial outage and partial success #181

Comments

adyach commented Sep 12, 2023

ePaul commented Sep 12, 2023

adyach commented Sep 12, 2023

ePaul commented Sep 12, 2023 • edited Loading

adyach commented Oct 9, 2023

ePaul commented Oct 12, 2023

ePaul commented Sep 12, 2023 •

edited

Loading