Skip to content

feat(firehose sink): implement partial failure retry (draft! for discussion only at this point)#16703

Closed
jasongoodwin wants to merge 1 commit intovectordotdev:masterfrom
jasongoodwin:firehose_partials
Closed

feat(firehose sink): implement partial failure retry (draft! for discussion only at this point)#16703
jasongoodwin wants to merge 1 commit intovectordotdev:masterfrom
jasongoodwin:firehose_partials

Conversation

@jasongoodwin
Copy link
Contributor

@jasongoodwin jasongoodwin commented Mar 7, 2023

DON'T MERGE - This PR is absolutely the most minimal thing I could do to get a discussion going.

Been trying to get some discussion going on this on discord etc but haven't caught anyone yet.

Descriptions:
Currently partial failures are retried in entirety in ES but in Kinesis, they are ignored and the data is lost.
I have a change I'm working on to contribute a feature like ES - to retry the whole batch - but I'm worried something might get stuck and infinitely fail.
I was wondering how you might suggest I contribute a feature to actually retry only the failed records, or if you think that a complete retry like in ES is the right approach, and the sinks should de-duplicate anything in firehose.

Retrying the entire batch is likely okay for Kinesis Streams, which will de-dup, but it will cause duplication in Kinesis Firehose.
This may be okay as it's already at least once delivery semantics, and there isn't anyway to deduplicate.

I'll likely work on both solutions but trying to get some feedback! I'm on discord in the AWS channel if you want to talk, or can discuss here.

Some more detailed notes below.


Overview

While acknowledgements have been implemented for the Vector firehose sink, there are a couple issues with Vector still:
#16578
#12835

Because Vector will drop data, it needs to have a mechanism for either fully retrying partial failures (this is what the ES sink does in vector), or else trying only the subset of messages that failed. This change focuses on implementing a safe retry of only the partial failures, along with logging/metrics for any data that can’t get to firehose so data loss can be monitored once running at scale.

In Depth Design

The response from firehose will return a 200, but the acknowledgement is all or nothing in vector. This change describes how to accomplish partial retries.

Kinesis Firehose Delivery Semantics

Kinesis Firehose is at-least-once delivery semantics and will have duplicate data published to it to some degree. The messages need to contain something like a unique event ID and there needs to be a machanism for deduplication to achieve exactly once semantics. As an example, indexing on the unique event ID in a sink such as elasticsearch would deduplicate any events.

(Note that other sinks such as Kinesis Streams does have exactly once semantics by deduplicating events on an id.)

Proposed Solution 1: Complete Retry for Partial Failure

The way that vector is designed doesn’t allow any way of retrying only pieces of a batch request, and the current approach in other sinks is to retry a partially failed operation in vector (this is how ElasticSearch sink implements retry.)

https://vector.dev/docs/reference/configuration/sinks/elasticsearch/#request_retry_partial

This approach would work for firehose, assuming we handle deduplication downstream, but the unknown risk is that a request may never succeed, halting processing or causing other pathological effects.

Proposed Solution 2: Handle Partial Failure Retry Inside the Sink

Without a redesign of vector, another solution is to handle partial failures in the sink. To mitigate any risk of infinite failure, this is likely safer than trying to use the existing retry semantics.

The firehose sink can have its own partial-failure retry configuration (eg not-infinite retry) and manage re-attempting the records that fail in a batch a few times, before logging any records it can’t deliver, and acknowledging.

The firehose response contains a list of request responses that are ordered with the request so it’s easy to zip and iter to collect any failures.

Performance Consideration

If the retries block the stream, the delay needs to be considered.

@jasongoodwin jasongoodwin requested a review from a team March 7, 2023 02:14
@bits-bot
Copy link

bits-bot commented Mar 7, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Jason Goodwin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@netlify
Copy link

netlify bot commented Mar 7, 2023

Deploy Preview for vector-project failed.

Name Link
🔨 Latest commit 622743e
🔍 Latest deploy log https://app.netlify.com/sites/vector-project/deploys/64069def9f8eb60008a147a9

@netlify
Copy link

netlify bot commented Mar 7, 2023

Deploy Preview for vrl-playground ready!

Name Link
🔨 Latest commit 622743e
🔍 Latest deploy log https://app.netlify.com/sites/vrl-playground/deploys/64069defdbfdc70008f817b0
😎 Deploy Preview https://deploy-preview-16703--vrl-playground.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Mar 7, 2023
if let Ok(resp) = &r {
if resp.failed_put_count().unwrap_or(0) > 0 {
#[cfg(not(test))] // the wait fails during test for some reason.
sleep(Duration::from_millis(100)).await;
Copy link
Contributor Author

@jasongoodwin jasongoodwin Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't any configuration or backoff at all. Really, just an example at this point to hopefully get some eyes/input.

@jasongoodwin jasongoodwin changed the title feat(firehose sink): implement naive partial retry (Don't merge! for discussion at this point) feat(firehose sink): implement partial failure retry (Don't merge! for discussion at this point) Mar 7, 2023
for _ in 1..=3 {
if let Ok(resp) = &r {
if resp.failed_put_count().unwrap_or(0) > 0 {
#[cfg(not(test))] // the wait fails during test for some reason.
Copy link
Contributor Author

@jasongoodwin jasongoodwin Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of odd, but I noticed that the tokio sleep causes a test failure, even if it isn't executed.

for (rec, response) in itr {
// TODO can just filter
if response.error_code().is_some() {
failed_records.push(rec.clone());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all kinds of cloning and unidiomatic rust - this is just an example for now.

@jasongoodwin jasongoodwin changed the title feat(firehose sink): implement partial failure retry (Don't merge! for discussion at this point) feat(firehose sink): implement partial failure retry (draft! for discussion only at this point) Mar 7, 2023
@fuchsnj fuchsnj added the domain: reliability Anything related to Vector's reliability label Mar 7, 2023
@spencergilbert spencergilbert requested a review from a team March 8, 2023 16:20
@spencergilbert spencergilbert self-assigned this Mar 8, 2023
@spencergilbert
Copy link
Contributor

Hey @jasongoodwin, first off - thanks for opening this PR for discussion and spiking out the implementation.

@bruceg and I looked it over today and while we like the direction of this enhancement we think there are some problems with it that would currently block us from moving forward.

We think this partial retry logic would need to be hoisted up into the Driver rather in the sink here. That should allow it to interact with the Adaptive Request Concurrency feature, rather than causing throttling when retrying these partial failures.

This is likely to be a larger change, and possibly require an RFC before implementing - if you're interesting in pushing that forward we'd of course be happy to guide the contribution, but we definitely understand if that's too much.

We do think the second PR you opened is a step in the right direction, as it re-uses an existing behavior from another sink - as well as being replaceable once a more fine grained retry mechanism is in place. I'll be reviewing the second PR shortly, and we should be able to get it merged without too much fuss.

@jasongoodwin
Copy link
Contributor Author

jasongoodwin commented Mar 15, 2023

Okay great - thanks! I'm going to close this one - was looking for feedback mostly and I know this isn't the right way :)
So I can see the feedback here is in the Driver - I'm interested in picking this up as a weekend project - I'll try to understand some of those details a bit better.

@spencergilbert
Copy link
Contributor

I'll also try and find some time to discuss with @bruceg and see if we can get some initial pointers/direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: reliability Anything related to Vector's reliability domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants