feat(firehose sink): implement partial failure retry (draft! for discussion only at this point)#16703
feat(firehose sink): implement partial failure retry (draft! for discussion only at this point)#16703jasongoodwin wants to merge 1 commit intovectordotdev:masterfrom
Conversation
|
Jason Goodwin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
❌ Deploy Preview for vector-project failed.
|
✅ Deploy Preview for vrl-playground ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
| if let Ok(resp) = &r { | ||
| if resp.failed_put_count().unwrap_or(0) > 0 { | ||
| #[cfg(not(test))] // the wait fails during test for some reason. | ||
| sleep(Duration::from_millis(100)).await; |
There was a problem hiding this comment.
There isn't any configuration or backoff at all. Really, just an example at this point to hopefully get some eyes/input.
| for _ in 1..=3 { | ||
| if let Ok(resp) = &r { | ||
| if resp.failed_put_count().unwrap_or(0) > 0 { | ||
| #[cfg(not(test))] // the wait fails during test for some reason. |
There was a problem hiding this comment.
Kind of odd, but I noticed that the tokio sleep causes a test failure, even if it isn't executed.
| for (rec, response) in itr { | ||
| // TODO can just filter | ||
| if response.error_code().is_some() { | ||
| failed_records.push(rec.clone()); |
There was a problem hiding this comment.
all kinds of cloning and unidiomatic rust - this is just an example for now.
|
Hey @jasongoodwin, first off - thanks for opening this PR for discussion and spiking out the implementation. @bruceg and I looked it over today and while we like the direction of this enhancement we think there are some problems with it that would currently block us from moving forward. We think this partial retry logic would need to be hoisted up into the This is likely to be a larger change, and possibly require an RFC before implementing - if you're interesting in pushing that forward we'd of course be happy to guide the contribution, but we definitely understand if that's too much. We do think the second PR you opened is a step in the right direction, as it re-uses an existing behavior from another sink - as well as being replaceable once a more fine grained retry mechanism is in place. I'll be reviewing the second PR shortly, and we should be able to get it merged without too much fuss. |
|
Okay great - thanks! I'm going to close this one - was looking for feedback mostly and I know this isn't the right way :) |
|
I'll also try and find some time to discuss with @bruceg and see if we can get some initial pointers/direction. |
DON'T MERGE - This PR is absolutely the most minimal thing I could do to get a discussion going.
Been trying to get some discussion going on this on discord etc but haven't caught anyone yet.
Descriptions:
Currently partial failures are retried in entirety in ES but in Kinesis, they are ignored and the data is lost.
I have a change I'm working on to contribute a feature like ES - to retry the whole batch - but I'm worried something might get stuck and infinitely fail.
I was wondering how you might suggest I contribute a feature to actually retry only the failed records, or if you think that a complete retry like in ES is the right approach, and the sinks should de-duplicate anything in firehose.
Retrying the entire batch is likely okay for Kinesis Streams, which will de-dup, but it will cause duplication in Kinesis Firehose.
This may be okay as it's already at least once delivery semantics, and there isn't anyway to deduplicate.
I'll likely work on both solutions but trying to get some feedback! I'm on discord in the AWS channel if you want to talk, or can discuss here.
Some more detailed notes below.
Overview
While acknowledgements have been implemented for the Vector firehose sink, there are a couple issues with Vector still:
#16578
#12835
Because Vector will drop data, it needs to have a mechanism for either fully retrying partial failures (this is what the ES sink does in vector), or else trying only the subset of messages that failed. This change focuses on implementing a safe retry of only the partial failures, along with logging/metrics for any data that can’t get to firehose so data loss can be monitored once running at scale.
In Depth Design
The response from firehose will return a 200, but the acknowledgement is all or nothing in vector. This change describes how to accomplish partial retries.
Kinesis Firehose Delivery Semantics
Kinesis Firehose is at-least-once delivery semantics and will have duplicate data published to it to some degree. The messages need to contain something like a unique event ID and there needs to be a machanism for deduplication to achieve exactly once semantics. As an example, indexing on the unique event ID in a sink such as elasticsearch would deduplicate any events.
(Note that other sinks such as Kinesis Streams does have exactly once semantics by deduplicating events on an id.)
Proposed Solution 1: Complete Retry for Partial Failure
The way that vector is designed doesn’t allow any way of retrying only pieces of a batch request, and the current approach in other sinks is to retry a partially failed operation in vector (this is how ElasticSearch sink implements retry.)
https://vector.dev/docs/reference/configuration/sinks/elasticsearch/#request_retry_partial
This approach would work for firehose, assuming we handle deduplication downstream, but the unknown risk is that a request may never succeed, halting processing or causing other pathological effects.
Proposed Solution 2: Handle Partial Failure Retry Inside the Sink
Without a redesign of vector, another solution is to handle partial failures in the sink. To mitigate any risk of infinite failure, this is likely safer than trying to use the existing retry semantics.
The firehose sink can have its own partial-failure retry configuration (eg not-infinite retry) and manage re-attempting the records that fail in a batch a few times, before logging any records it can’t deliver, and acknowledging.
The firehose response contains a list of request responses that are ordered with the request so it’s easy to zip and iter to collect any failures.
Performance Consideration
If the retries block the stream, the delay needs to be considered.