@num_errors is not zero cleared after a successful retry. #1379

shuji-koike · 2016-12-16T12:26:16Z

I'm trying to monitor retry_count with monitor_agent plugin, and was working as expected in 0.12.x.
However in 0.14.x retry_count acts different. It just seems to count up and not geting zero cleared after a successful flush.

retry_count is actually the value of @num_errors in output plugins.
https://github.com/fluent/fluentd/blob/be7ffb2/lib/fluent/plugin/in_monitor_agent.rb#L259

In 0.12 code, @num_errors is implemented to be zero cleared after a successful retry.
https://github.com/fluent/fluentd/blob/v0.12.31/lib/fluent/output.rb#L353

In 0.14, there is no corresponding implementation.
https://github.com/fluent/fluentd/blob/be7ffb2/lib/fluent/plugin/output.rb

Will this be the new behavior of @num_errors in 0.14? or shall we say it is a bug or a TODO to be fixed?

The text was updated successfully, but these errors were encountered:

tagomoris · 2016-12-19T04:22:44Z

I fixed @num_errors as the total count of output errors, as the same with @emit_count(the total count of emit called, even in v0.12). But I missed that @num_errors is referred by monitor_agent.
There's a method to show the count of retries (in failing): @retry.steps in v0.14. Should we refer this value in in_monitor_agent?

CSharpRU · 2016-12-25T16:43:54Z

Hello, right now you are able to check real retry count for each plugin by installing fluentd from master branch. See #1387

shuji-koike · 2017-01-13T07:03:08Z

Hi! @CSharpRU
We have updated our fluentd to 0.14.11 and started using retry.steps for monitoring 🎉
Thanks for your work!

shuji-koike · 2017-01-13T07:04:58Z

@tagomoris san
Should I close this issue? Dose pending label mean something?

tagomoris · 2017-01-13T07:28:33Z

I'm wondering whether we should re-fix @num_errors as previous count for compatibility reason or not, and haven't got enough comments/feedbacks for it. That's why I set pending label here.
If @shuji-koike you think that there's no problem with latest release, please close this by yourself.

shuji-koike · 2017-01-13T07:56:08Z

I'm not sure whether @num_errors and retry_count's behavior have changed since 0.14.0 or later but,
IMHO, braking the compatibility among (patch release versions of) 0.14.x may also need be concerned.

I may want to +1 not to re-fix @num_errors (retry_count), and add a note in the documentation.
Compatibility between 0.12 and 0.14 will break but the new behavior of retry_count may be useful in some cases.

I also appreciate for more comments/feedbacks.

shuji-koike · 2017-01-13T08:38:51Z

p.s.
My main concern is about retry_count's behavior but I have no idea how @num_errors should be fixed (or not).
As @tagomoris san said, re-fixing @num_errors may be a better option considering overall usage of @num_errors, but I have no clue 😝

CSharpRU · 2017-01-13T08:55:43Z

I think that overall num of errors is useful too.

wimnat · 2018-01-09T05:44:27Z

My 2 cents here is that datadog integration breaks because it is using monitor agent. You can't recover an alert on retries because the metric never goes back to 0.

shivamdixit · 2019-01-15T10:37:28Z

I'm facing the same issue as mentioned by @wimnat. The alerts never recover because metric never goes back to 0 and the only option is to restart the process.

Is there a reliable metric to know if there are errors at any given instance?

PS: If retry_count == num_errors, what's the purpose of having two different metrics for the same thing?

repeatedly · 2019-01-15T10:47:41Z

I'm not sure datadog integration but if datadog integration uses in_monitor_agent's code directly, it should be fixed.

https://docs.fluentd.org/v1.0/articles/in_monitor_agent#in-retry

They can use ["retry"]["steps"] field if it exists.

PS: If retry_count == num_errors, what's the purpose of having two different metrics for the same thing?

Internal code uses num_errors name and in_monitor_agent uses retry_count field for exposed metrics. No two different metrics.

repeatedly · 2019-01-15T11:33:26Z

BTW, this issue is old and we go with current implementation. So closed.

repeatedly · 2019-01-17T17:49:58Z

No one send the patch to datadog so I send a patch.

DataDog/integrations-core#2965

jurim76 · 2020-09-04T16:10:27Z

Same issue for elasticsearch plugin (td-agent 3.7.1). "retry_count" is not reseting to zero after a successfull retry and monitoring system always throws alert (trigger alert if retry_count > 0). Manual service restart is not "production" solution, imho.

tagomoris added bug Something isn't working v0.14 labels Dec 19, 2016

tagomoris self-assigned this Dec 19, 2016

CSharpRU mentioned this issue Dec 22, 2016

Ability to get real retry state of buffered output #1387

Merged

tagomoris added the pending To be done in the future label Dec 26, 2016

sanjuicgithub mentioned this issue Dec 11, 2018

fluentd_output_status_num_errors is not reseting to zero after a successful retry #2230

Closed

repeatedly closed this as completed Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@num_errors is not zero cleared after a successful retry. #1379

@num_errors is not zero cleared after a successful retry. #1379

shuji-koike commented Dec 16, 2016 •

edited

Loading

tagomoris commented Dec 19, 2016

CSharpRU commented Dec 25, 2016

shuji-koike commented Jan 13, 2017

shuji-koike commented Jan 13, 2017 •

edited

Loading

tagomoris commented Jan 13, 2017

shuji-koike commented Jan 13, 2017 •

edited

Loading

shuji-koike commented Jan 13, 2017

CSharpRU commented Jan 13, 2017

wimnat commented Jan 9, 2018 •

edited

Loading

shivamdixit commented Jan 15, 2019

repeatedly commented Jan 15, 2019

repeatedly commented Jan 15, 2019

repeatedly commented Jan 17, 2019

jurim76 commented Sep 4, 2020

@num_errors is not zero cleared after a successful retry. #1379

@num_errors is not zero cleared after a successful retry. #1379

Comments

shuji-koike commented Dec 16, 2016 • edited Loading

tagomoris commented Dec 19, 2016

CSharpRU commented Dec 25, 2016

shuji-koike commented Jan 13, 2017

shuji-koike commented Jan 13, 2017 • edited Loading

tagomoris commented Jan 13, 2017

shuji-koike commented Jan 13, 2017 • edited Loading

shuji-koike commented Jan 13, 2017

CSharpRU commented Jan 13, 2017

wimnat commented Jan 9, 2018 • edited Loading

shivamdixit commented Jan 15, 2019

repeatedly commented Jan 15, 2019

repeatedly commented Jan 15, 2019

repeatedly commented Jan 17, 2019

jurim76 commented Sep 4, 2020

shuji-koike commented Dec 16, 2016 •

edited

Loading

shuji-koike commented Jan 13, 2017 •

edited

Loading

shuji-koike commented Jan 13, 2017 •

edited

Loading

wimnat commented Jan 9, 2018 •

edited

Loading