feat(firehose sink): implement partial failure retry (draft! for discussion only at this point) by jasongoodwin · Pull Request #16703 · vectordotdev/vector

jasongoodwin · 2023-03-07T02:14:05Z

DON'T MERGE - This PR is absolutely the most minimal thing I could do to get a discussion going.

Been trying to get some discussion going on this on discord etc but haven't caught anyone yet.

Descriptions:
Currently partial failures are retried in entirety in ES but in Kinesis, they are ignored and the data is lost.
I have a change I'm working on to contribute a feature like ES - to retry the whole batch - but I'm worried something might get stuck and infinitely fail.
I was wondering how you might suggest I contribute a feature to actually retry only the failed records, or if you think that a complete retry like in ES is the right approach, and the sinks should de-duplicate anything in firehose.

Retrying the entire batch is likely okay for Kinesis Streams, which will de-dup, but it will cause duplication in Kinesis Firehose.
This may be okay as it's already at least once delivery semantics, and there isn't anyway to deduplicate.

I'll likely work on both solutions but trying to get some feedback! I'm on discord in the AWS channel if you want to talk, or can discuss here.

Some more detailed notes below.

Overview

While acknowledgements have been implemented for the Vector firehose sink, there are a couple issues with Vector still:
#16578
#12835

Because Vector will drop data, it needs to have a mechanism for either fully retrying partial failures (this is what the ES sink does in vector), or else trying only the subset of messages that failed. This change focuses on implementing a safe retry of only the partial failures, along with logging/metrics for any data that can’t get to firehose so data loss can be monitored once running at scale.

In Depth Design

The response from firehose will return a 200, but the acknowledgement is all or nothing in vector. This change describes how to accomplish partial retries.

Kinesis Firehose Delivery Semantics

Kinesis Firehose is at-least-once delivery semantics and will have duplicate data published to it to some degree. The messages need to contain something like a unique event ID and there needs to be a machanism for deduplication to achieve exactly once semantics. As an example, indexing on the unique event ID in a sink such as elasticsearch would deduplicate any events.

(Note that other sinks such as Kinesis Streams does have exactly once semantics by deduplicating events on an id.)

Proposed Solution 1: Complete Retry for Partial Failure

The way that vector is designed doesn’t allow any way of retrying only pieces of a batch request, and the current approach in other sinks is to retry a partially failed operation in vector (this is how ElasticSearch sink implements retry.)

https://vector.dev/docs/reference/configuration/sinks/elasticsearch/#request_retry_partial

This approach would work for firehose, assuming we handle deduplication downstream, but the unknown risk is that a request may never succeed, halting processing or causing other pathological effects.

Proposed Solution 2: Handle Partial Failure Retry Inside the Sink

Without a redesign of vector, another solution is to handle partial failures in the sink. To mitigate any risk of infinite failure, this is likely safer than trying to use the existing retry semantics.

The firehose sink can have its own partial-failure retry configuration (eg not-infinite retry) and manage re-attempting the records that fail in a batch a few times, before logging any records it can’t deliver, and acknowledging.

The firehose response contains a list of request responses that are ordered with the request so it’s easy to zip and iter to collect any failures.

Performance Consideration

If the retries block the stream, the delay needs to be considered.

bits-bot · 2023-03-07T02:14:09Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Jason Goodwin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

netlify · 2023-03-07T02:14:09Z

❌ Deploy Preview for vector-project failed.

Name	Link
🔨 Latest commit	`622743e`
🔍 Latest deploy log	https://app.netlify.com/sites/vector-project/deploys/64069def9f8eb60008a147a9

netlify · 2023-03-07T02:14:09Z

✅ Deploy Preview for vrl-playground ready!

Name	Link
🔨 Latest commit	`622743e`
🔍 Latest deploy log	https://app.netlify.com/sites/vrl-playground/deploys/64069defdbfdc70008f817b0
😎 Deploy Preview	https://deploy-preview-16703--vrl-playground.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

jasongoodwin · 2023-03-07T02:14:41Z

src/sinks/aws_kinesis/firehose/record.rs

+            if let Ok(resp) = &r {
+                if resp.failed_put_count().unwrap_or(0) > 0 {
+                    #[cfg(not(test))] // the wait fails during test for some reason.
+                    sleep(Duration::from_millis(100)).await;


There isn't any configuration or backoff at all. Really, just an example at this point to hopefully get some eyes/input.

jasongoodwin · 2023-03-07T02:16:37Z

src/sinks/aws_kinesis/firehose/record.rs

+        for _ in 1..=3 {
+            if let Ok(resp) = &r {
+                if resp.failed_put_count().unwrap_or(0) > 0 {
+                    #[cfg(not(test))] // the wait fails during test for some reason.


Kind of odd, but I noticed that the tokio sleep causes a test failure, even if it isn't executed.

jasongoodwin · 2023-03-07T02:17:04Z

src/sinks/aws_kinesis/firehose/record.rs

+                    for (rec, response) in itr {
+                        // TODO can just filter
+                        if response.error_code().is_some() {
+                            failed_records.push(rec.clone());


all kinds of cloning and unidiomatic rust - this is just an example for now.

spencergilbert · 2023-03-13T20:15:51Z

Hey @jasongoodwin, first off - thanks for opening this PR for discussion and spiking out the implementation.

@bruceg and I looked it over today and while we like the direction of this enhancement we think there are some problems with it that would currently block us from moving forward.

We think this partial retry logic would need to be hoisted up into the Driver rather in the sink here. That should allow it to interact with the Adaptive Request Concurrency feature, rather than causing throttling when retrying these partial failures.

This is likely to be a larger change, and possibly require an RFC before implementing - if you're interesting in pushing that forward we'd of course be happy to guide the contribution, but we definitely understand if that's too much.

We do think the second PR you opened is a step in the right direction, as it re-uses an existing behavior from another sink - as well as being replaceable once a more fine grained retry mechanism is in place. I'll be reviewing the second PR shortly, and we should be able to get it merged without too much fuss.

jasongoodwin · 2023-03-15T17:54:38Z

Okay great - thanks! I'm going to close this one - was looking for feedback mostly and I know this isn't the right way :)
So I can see the feedback here is in the Driver - I'm interested in picking this up as a weekend project - I'll try to understand some of those details a bit better.

spencergilbert · 2023-03-15T18:05:45Z

I'll also try and find some time to discuss with @bruceg and see if we can get some initial pointers/direction.

feat(firehose): implement naive partial retry

622743e

jasongoodwin requested a review from spencergilbert as a code owner March 7, 2023 02:14

jasongoodwin requested a review from a team March 7, 2023 02:14

github-actions bot added the domain: sinks Anything related to the Vector's sinks label Mar 7, 2023

jasongoodwin commented Mar 7, 2023

View reviewed changes

jasongoodwin changed the title ~~feat(firehose sink): implement naive partial retry (Don't merge! for discussion at this point)~~ feat(firehose sink): implement partial failure retry (Don't merge! for discussion at this point) Mar 7, 2023

jasongoodwin commented Mar 7, 2023

View reviewed changes

jasongoodwin changed the title ~~feat(firehose sink): implement partial failure retry (Don't merge! for discussion at this point)~~ feat(firehose sink): implement partial failure retry (draft! for discussion only at this point) Mar 7, 2023

fuchsnj added the domain: reliability Anything related to Vector's reliability label Mar 7, 2023

spencergilbert requested a review from a team March 8, 2023 16:20

spencergilbert self-assigned this Mar 8, 2023

jasongoodwin closed this Mar 15, 2023

jasongoodwin mentioned this pull request Mar 15, 2023

feat(kinesis sinks): implement full retry of partial failures in firehose/streams #16771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(firehose sink): implement partial failure retry (draft! for discussion only at this point)#16703

feat(firehose sink): implement partial failure retry (draft! for discussion only at this point)#16703
jasongoodwin wants to merge 1 commit intovectordotdev:masterfrom
jasongoodwin:firehose_partials

jasongoodwin commented Mar 7, 2023 •

edited

Loading

Uh oh!

bits-bot commented Mar 7, 2023

Uh oh!

netlify bot commented Mar 7, 2023 •

edited

Loading

Uh oh!

netlify bot commented Mar 7, 2023 •

edited

Loading

Uh oh!

jasongoodwin Mar 7, 2023 •

edited

Loading

Uh oh!

jasongoodwin Mar 7, 2023 •

edited

Loading

Uh oh!

jasongoodwin Mar 7, 2023

Uh oh!

spencergilbert commented Mar 13, 2023

Uh oh!

jasongoodwin commented Mar 15, 2023 •

edited

Loading

Uh oh!

spencergilbert commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jasongoodwin commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

In Depth Design

Kinesis Firehose Delivery Semantics

Proposed Solution 1: Complete Retry for Partial Failure

Proposed Solution 2: Handle Partial Failure Retry Inside the Sink

Performance Consideration

Uh oh!

bits-bot commented Mar 7, 2023

Uh oh!

netlify bot commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Deploy Preview for vector-project failed.

Uh oh!

netlify bot commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vrl-playground ready!

Uh oh!

jasongoodwin Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasongoodwin Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasongoodwin Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

spencergilbert commented Mar 13, 2023

Uh oh!

jasongoodwin commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spencergilbert commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jasongoodwin commented Mar 7, 2023 •

edited

Loading

netlify bot commented Mar 7, 2023 •

edited

Loading

netlify bot commented Mar 7, 2023 •

edited

Loading

jasongoodwin Mar 7, 2023 •

edited

Loading

jasongoodwin Mar 7, 2023 •

edited

Loading

jasongoodwin commented Mar 15, 2023 •

edited

Loading