Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@num_errors is not zero cleared after a successful retry. #1379

Closed
shuji-koike opened this issue Dec 16, 2016 · 14 comments
Closed

@num_errors is not zero cleared after a successful retry. #1379

shuji-koike opened this issue Dec 16, 2016 · 14 comments
Assignees
Labels
bug Something isn't working pending To be done in the future v0.14

Comments

@shuji-koike
Copy link
Contributor

shuji-koike commented Dec 16, 2016

I'm trying to monitor retry_count with monitor_agent plugin, and was working as expected in 0.12.x.
However in 0.14.x retry_count acts different. It just seems to count up and not geting zero cleared after a successful flush.

retry_count is actually the value of @num_errors in output plugins.
https://github.com/fluent/fluentd/blob/be7ffb2/lib/fluent/plugin/in_monitor_agent.rb#L259

In 0.12 code, @num_errors is implemented to be zero cleared after a successful retry.
https://github.com/fluent/fluentd/blob/v0.12.31/lib/fluent/output.rb#L353

In 0.14, there is no corresponding implementation.
https://github.com/fluent/fluentd/blob/be7ffb2/lib/fluent/plugin/output.rb

Will this be the new behavior of @num_errors in 0.14? or shall we say it is a bug or a TODO to be fixed?

@tagomoris
Copy link
Member

I fixed @num_errors as the total count of output errors, as the same with @emit_count(the total count of emit called, even in v0.12). But I missed that @num_errors is referred by monitor_agent.
There's a method to show the count of retries (in failing): @retry.steps in v0.14. Should we refer this value in in_monitor_agent?

@CSharpRU
Copy link
Contributor

Hello, right now you are able to check real retry count for each plugin by installing fluentd from master branch. See #1387

@tagomoris tagomoris added the pending To be done in the future label Dec 26, 2016
@shuji-koike
Copy link
Contributor Author

Hi! @CSharpRU
We have updated our fluentd to 0.14.11 and started using retry.steps for monitoring 🎉
Thanks for your work!

@shuji-koike
Copy link
Contributor Author

shuji-koike commented Jan 13, 2017

@tagomoris san
Should I close this issue? Dose pending label mean something?

@tagomoris
Copy link
Member

I'm wondering whether we should re-fix @num_errors as previous count for compatibility reason or not, and haven't got enough comments/feedbacks for it. That's why I set pending label here.
If @shuji-koike you think that there's no problem with latest release, please close this by yourself.

@shuji-koike
Copy link
Contributor Author

shuji-koike commented Jan 13, 2017

I'm not sure whether @num_errors and retry_count's behavior have changed since 0.14.0 or later but,
IMHO, braking the compatibility among (patch release versions of) 0.14.x may also need be concerned.

I may want to +1 not to re-fix @num_errors (retry_count), and add a note in the documentation.
Compatibility between 0.12 and 0.14 will break but the new behavior of retry_count may be useful in some cases.

I also appreciate for more comments/feedbacks.

@shuji-koike
Copy link
Contributor Author

p.s.
My main concern is about retry_count's behavior but I have no idea how @num_errors should be fixed (or not).
As @tagomoris san said, re-fixing @num_errors may be a better option considering overall usage of @num_errors, but I have no clue 😝

@CSharpRU
Copy link
Contributor

I think that overall num of errors is useful too.

@wimnat
Copy link

wimnat commented Jan 9, 2018

My 2 cents here is that datadog integration breaks because it is using monitor agent. You can't recover an alert on retries because the metric never goes back to 0.

@shivamdixit
Copy link

I'm facing the same issue as mentioned by @wimnat. The alerts never recover because metric never goes back to 0 and the only option is to restart the process.

Is there a reliable metric to know if there are errors at any given instance?

PS: If retry_count == num_errors, what's the purpose of having two different metrics for the same thing?

@repeatedly
Copy link
Member

I'm not sure datadog integration but if datadog integration uses in_monitor_agent's code directly, it should be fixed.

https://docs.fluentd.org/v1.0/articles/in_monitor_agent#in-retry

They can use ["retry"]["steps"] field if it exists.

PS: If retry_count == num_errors, what's the purpose of having two different metrics for the same thing?

Internal code uses num_errors name and in_monitor_agent uses retry_count field for exposed metrics. No two different metrics.

@repeatedly
Copy link
Member

BTW, this issue is old and we go with current implementation. So closed.

@repeatedly
Copy link
Member

No one send the patch to datadog so I send a patch.

DataDog/integrations-core#2965

@jurim76
Copy link

jurim76 commented Sep 4, 2020

Same issue for elasticsearch plugin (td-agent 3.7.1). "retry_count" is not reseting to zero after a successfull retry and monitoring system always throws alert (trigger alert if retry_count > 0). Manual service restart is not "production" solution, imho.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending To be done in the future v0.14
Projects
None yet
Development

No branches or pull requests

7 participants