[Question] Partial Retries Support #1911

qingling128 · 2018-03-27T22:30:28Z

Hi,

Any chance there is a way to do partial retries in Fluentd? Right now we maintain an output plugin https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud. For each buffer chunk the plugin processes, we might have some "good" entries and some "bad" entries. We'd like to enable users to flush the "good" entries right away but retry the "bad" entries later.

In the current implementation, seems that an output plugin can only throw an exception to indicate that this chunk is not processed and should be retried. The problem is that we either block the "good" entries as well, or we retry the "good" entries again and generate duplicate logs.

Is there a known way to do partial retries correctly with the built-in support?

Alternatively we are thinking about writing a new filter plugin to handle this case or dig into the core code and contribute / maintain a fluentd fork with partial retries.

Wanna get some sanity check before we went too far away unnecessarily.

Thanks,
Ling

repeatedly · 2018-04-03T22:57:32Z

Sorry for delay response.

What does 'bad' mean? I assume bad entries can't be flushed to destination, right?
If so, you can use emit_error_event to send bad entries to @ERROR label: https://docs.fluentd.org/v1.0/articles/api-plugin-helper-event_emitter

For bad chunk, we will add backup feature to avoid useless retry: #1856

qingling128 · 2018-04-04T02:01:58Z

The bad (can be flushed to destination at a later time) entries here are not due to client error, but some temporary server side errors. We would like to retry them later instead of dropping them.

Seems like we can use router.emit to retry them. We just need to figure out a way to back up for a period of time before re-enter them. router.emit does not seem to have a retry_wait param. Any idea if that's possible via the existing support? Or we'll come up with something on our own.

repeatedly · 2018-04-06T00:07:08Z

The bad (can be flushed to destination at a later time) entries here are not due to client error, but some temporary server side errors.

Currently, we have default retry mechanizm for it. Hmm... I didn't understand the situation correctly.
There are several failure situtations:

Temporal failure: Server side outage or network failure. We can use default retry mechanizm.
Invalid events: Log event has invalid format, e.g. mongodb's field name limitation. We can use router.emit_error_event to rescue it and send normal events to the destination.
Unrecoverable error: Schema mismatch or something. Use backup/secondary instead of retry.

I assume your problem is Temporal failure case but your pipeline contains good and bad entries in one buffer chunk, right?

qingling128 · 2018-04-06T04:10:05Z

@repeatedly - Yes! Our problem is Temporary failure but with ready-to-go and failed-but-need-to-be-retried entries in the same buffer chunk. By throwing an exception, we can retry everything. But instead we want to try the failed-but-need-to-be-retried ones only.

repeatedly · 2018-04-10T00:41:31Z

I see... I understood the situation.
Currently, output plugin doesn't support such retry/rollback feature.
How to check ready-to-go or failed-but-need-to-be-retried? Try to write first and the response contains the list of error event?

qingling128 · 2018-04-11T03:03:10Z

@repeatedly - We have a local binary running that provides metadata for log entries based on a label. Our Fluentd output plugin makes one request per log entry if the metadata for that label is not cached locally. Depending on whether the metadata is available, we might choose to retry a subset of log entries.

The reason behind this:
We are using one Fluentd instance to send logs for multiple containers. There is a short period of time gap (under one minute) between when a container starts running and when we will have the metadata available for that container. For any logs that were emitted by the container during this gap, we want to buffer the logs until later. But we don't wanna to buffer everything in the same chunk that is from other "mature" containers who has metadata available already.

cosmo0920 · 2018-04-11T05:08:30Z

@portante and his co-workers report similar issue.

The below is his explanation:

For example, there might be 100 records in a chunk, and 13 of them are bad. We want to have the ES plugin, or any other output plugin, send off the 87 that are good, but return the 13 that are bad.

ref: uken/fluent-plugin-elasticsearch#398 (comment)

Originally reported in Bugzilla@Red hat.

repeatedly · 2018-04-16T22:00:08Z

@qingling128 I see. Does your record have specific field for your metadata?
In v1, fluentd decide target chunk by chunk keys, <buffer CHUNK_KEYS>.
So if one, or more, field refers the corresponding metadata, good and bad chunks are separated before flush.

@cosmo0920 I saw the patch. Why @ERROR label is not fit for this case?
DLQ seems the specific version of @ERROR label.

cosmo0920 · 2018-04-17T05:49:25Z

@cosmo0920 I saw the patch. Why @ERROR label is not fit for this case?
DLQ seems the specific version of @ERROR label.

According to Red Hat's bugzilla, It seems that there is performance issue.
In ES plugin case, UTF-8 encoding conversion is too heavy to handle before sending fluent-plugin-elasticsearch.

jcantrill · 2018-04-17T13:23:52Z

@repeatedly @cosmo0920 As I stated in the originally issue, there was not necessarily a performance issue but a problem with the the way chunks were consumed by the fluent-elasticsearch-plugin. If any part of deserializing and processing records in the chunk failed, an exception would bubble up and the chunk would be retried indefinitely; there was no way to get past bad processing. Do we have an example of using the router with error label that would negate the need of #1948 so I could close it?

qingling128 · 2018-04-17T21:47:38Z

@repeatedly - Yes, our records do have a specific field local_resource_id that we rely on to retrieve metadata.

The problem with deciding chunk by local_resource_id is that it will force us to flush log entries (this happens after retrieving the metadata locally) into Logging API (via HTTP/RPC requests) slower, because it results in separate requests for each local_resource_id.

In a container world, this means if a Fluentd instance is handling logs from 50 containers, the number of requests we make to Logging API will be 50 * original request number. Previously, we can bundle log entries from 50 containers into one Logging API request. Now we have to make 50 requests (from 50 flushes). This will significantly slow us down.

qingling128 · 2018-06-26T17:22:48Z

Hi, any news about this feature request?

repeatedly added feature request *Deprecated Label* Use enhancement label in general v1 labels Apr 3, 2018

cosmo0920 mentioned this issue Apr 11, 2018

introduce dead letter queue to handle issues unpacking file buffer chunks uken/fluent-plugin-elasticsearch#398

Merged

7 tasks

jcantrill mentioned this issue Apr 16, 2018

add dlq handler for buffered outputs to handle corrupt chunks #1948

Closed

nokute78 mentioned this issue Mar 25, 2021

[Feature Request] Partial retry of output fluent/fluent-bit#3278

Closed

kenhys added enhancement Feature request or improve operations and removed feature request *Deprecated Label* Use enhancement label in general labels Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Partial Retries Support #1911

[Question] Partial Retries Support #1911

qingling128 commented Mar 27, 2018

repeatedly commented Apr 3, 2018

qingling128 commented Apr 4, 2018

repeatedly commented Apr 6, 2018

qingling128 commented Apr 6, 2018

repeatedly commented Apr 10, 2018

qingling128 commented Apr 11, 2018

cosmo0920 commented Apr 11, 2018

repeatedly commented Apr 16, 2018 •

edited

Loading

cosmo0920 commented Apr 17, 2018 •

edited

Loading

jcantrill commented Apr 17, 2018

qingling128 commented Apr 17, 2018

qingling128 commented Jun 26, 2018

[Question] Partial Retries Support #1911

[Question] Partial Retries Support #1911

Comments

qingling128 commented Mar 27, 2018

repeatedly commented Apr 3, 2018

qingling128 commented Apr 4, 2018

repeatedly commented Apr 6, 2018

qingling128 commented Apr 6, 2018

repeatedly commented Apr 10, 2018

qingling128 commented Apr 11, 2018

cosmo0920 commented Apr 11, 2018

repeatedly commented Apr 16, 2018 • edited Loading

cosmo0920 commented Apr 17, 2018 • edited Loading

jcantrill commented Apr 17, 2018

qingling128 commented Apr 17, 2018

qingling128 commented Jun 26, 2018

repeatedly commented Apr 16, 2018 •

edited

Loading

cosmo0920 commented Apr 17, 2018 •

edited

Loading