Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Partial Retries Support #1911

Open
qingling128 opened this issue Mar 27, 2018 · 12 comments
Open

[Question] Partial Retries Support #1911

qingling128 opened this issue Mar 27, 2018 · 12 comments
Labels
enhancement Feature request or improve operations v1

Comments

@qingling128
Copy link

Hi,

Any chance there is a way to do partial retries in Fluentd? Right now we maintain an output plugin https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud. For each buffer chunk the plugin processes, we might have some "good" entries and some "bad" entries. We'd like to enable users to flush the "good" entries right away but retry the "bad" entries later.

In the current implementation, seems that an output plugin can only throw an exception to indicate that this chunk is not processed and should be retried. The problem is that we either block the "good" entries as well, or we retry the "good" entries again and generate duplicate logs.

Is there a known way to do partial retries correctly with the built-in support?

Alternatively we are thinking about writing a new filter plugin to handle this case or dig into the core code and contribute / maintain a fluentd fork with partial retries.

Wanna get some sanity check before we went too far away unnecessarily.

Thanks,
Ling

@repeatedly
Copy link
Member

Sorry for delay response.

What does 'bad' mean? I assume bad entries can't be flushed to destination, right?
If so, you can use emit_error_event to send bad entries to @ERROR label: https://docs.fluentd.org/v1.0/articles/api-plugin-helper-event_emitter

For bad chunk, we will add backup feature to avoid useless retry: #1856

@repeatedly repeatedly added feature request *Deprecated Label* Use enhancement label in general v1 labels Apr 3, 2018
@qingling128
Copy link
Author

The bad (can be flushed to destination at a later time) entries here are not due to client error, but some temporary server side errors. We would like to retry them later instead of dropping them.

Seems like we can use router.emit to retry them. We just need to figure out a way to back up for a period of time before re-enter them. router.emit does not seem to have a retry_wait param. Any idea if that's possible via the existing support? Or we'll come up with something on our own.

@repeatedly
Copy link
Member

The bad (can be flushed to destination at a later time) entries here are not due to client error, but some temporary server side errors.

Currently, we have default retry mechanizm for it. Hmm... I didn't understand the situation correctly.
There are several failure situtations:

  • Temporal failure: Server side outage or network failure. We can use default retry mechanizm.
  • Invalid events: Log event has invalid format, e.g. mongodb's field name limitation. We can use router.emit_error_event to rescue it and send normal events to the destination.
  • Unrecoverable error: Schema mismatch or something. Use backup/secondary instead of retry.

I assume your problem is Temporal failure case but your pipeline contains good and bad entries in one buffer chunk, right?

@qingling128
Copy link
Author

@repeatedly - Yes! Our problem is Temporary failure but with ready-to-go and failed-but-need-to-be-retried entries in the same buffer chunk. By throwing an exception, we can retry everything. But instead we want to try the failed-but-need-to-be-retried ones only.

@repeatedly
Copy link
Member

I see... I understood the situation.
Currently, output plugin doesn't support such retry/rollback feature.
How to check ready-to-go or failed-but-need-to-be-retried? Try to write first and the response contains the list of error event?

@qingling128
Copy link
Author

@repeatedly - We have a local binary running that provides metadata for log entries based on a label. Our Fluentd output plugin makes one request per log entry if the metadata for that label is not cached locally. Depending on whether the metadata is available, we might choose to retry a subset of log entries.

The reason behind this:
We are using one Fluentd instance to send logs for multiple containers. There is a short period of time gap (under one minute) between when a container starts running and when we will have the metadata available for that container. For any logs that were emitted by the container during this gap, we want to buffer the logs until later. But we don't wanna to buffer everything in the same chunk that is from other "mature" containers who has metadata available already.

@cosmo0920
Copy link
Contributor

@portante and his co-workers report similar issue.

The below is his explanation:

For example, there might be 100 records in a chunk, and 13 of them are bad. We want to have the ES plugin, or any other output plugin, send off the 87 that are good, but return the 13 that are bad.

ref: uken/fluent-plugin-elasticsearch#398 (comment)

Originally reported in Bugzilla@Red hat.

@repeatedly
Copy link
Member

repeatedly commented Apr 16, 2018

@qingling128 I see. Does your record have specific field for your metadata?
In v1, fluentd decide target chunk by chunk keys, <buffer CHUNK_KEYS>.
So if one, or more, field refers the corresponding metadata, good and bad chunks are separated before flush.

@cosmo0920 I saw the patch. Why @ERROR label is not fit for this case?
DLQ seems the specific version of @ERROR label.

@cosmo0920
Copy link
Contributor

cosmo0920 commented Apr 17, 2018

@cosmo0920 I saw the patch. Why @ERROR label is not fit for this case?
DLQ seems the specific version of @ERROR label.

According to Red Hat's bugzilla, It seems that there is performance issue.
In ES plugin case, UTF-8 encoding conversion is too heavy to handle before sending fluent-plugin-elasticsearch.

@jcantrill
Copy link

@repeatedly @cosmo0920 As I stated in the originally issue, there was not necessarily a performance issue but a problem with the the way chunks were consumed by the fluent-elasticsearch-plugin. If any part of deserializing and processing records in the chunk failed, an exception would bubble up and the chunk would be retried indefinitely; there was no way to get past bad processing. Do we have an example of using the router with error label that would negate the need of #1948 so I could close it?

@qingling128
Copy link
Author

@repeatedly - Yes, our records do have a specific field local_resource_id that we rely on to retrieve metadata.

The problem with deciding chunk by local_resource_id is that it will force us to flush log entries (this happens after retrieving the metadata locally) into Logging API (via HTTP/RPC requests) slower, because it results in separate requests for each local_resource_id.

In a container world, this means if a Fluentd instance is handling logs from 50 containers, the number of requests we make to Logging API will be 50 * original request number. Previously, we can bundle log entries from 50 containers into one Logging API request. Now we have to make 50 requests (from 50 flushes). This will significantly slow us down.

@qingling128
Copy link
Author

Hi, any news about this feature request?

@kenhys kenhys added enhancement Feature request or improve operations and removed feature request *Deprecated Label* Use enhancement label in general labels Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature request or improve operations v1
Projects
None yet
Development

No branches or pull requests

5 participants