Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increment gather_errors for all errors emitted by inputs #2339

Closed
wants to merge 1 commit into from
Closed

Increment gather_errors for all errors emitted by inputs #2339

wants to merge 1 commit into from

Conversation

cosmopetrich
Copy link

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

I'm not sure if the change made here is the correct way to go about this, but I figured a PR would be better way to start a discussion than an issue or groups post.

Most Telegraf input plugins don't currently seem to provide a metric that can be used to determine if the gather operation is running successfully. For example, the prometheus input plugin logs an error if given a target address that returns HTTP 401, but won't return any metrics. That makes it difficult to tell whether a particular Prometheus client is "up".

I'd thought of deploying http_response inputs alongside httpjson/prometheus inputs, but that's a fair bit of extra configuration and doesn't handle endpoints that return 200 but with invalid data, etc. Adding a new metric to the prometheus input was another option but it looks like some other plugins behave similarly and it would be nice to have a more generic solution.

Telegraf 1.2 added the internal plugin, which exposes an internal_agent_gather_errors metric. That seems like a reasonable thing to monitor, however as far as I can tell it's only incremented by the SNMP plugin. This PR aims to increment the metric whenever any input emits an error. This won't catch all errors, as quite a number of plugins handle and log errors internally. I can update those too, but that's probably best in a separate PR.

@@ -157,7 +157,7 @@ func gatherWithTimeout(
select {
case err := <-done:
if err != nil {
log.Printf("E! ERROR in input [%s]: %s", input.Name(), err)
acc.AddError(err)
}
return
case <-ticker.C:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether it made sense to invoke AddError on "took longer than collection interval" errors.

@sparrc sparrc added this to the 1.3.0 milestone Jan 30, 2017
@sparrc
Copy link
Contributor

sparrc commented Feb 2, 2017

this certainly seems like a good idea to me, what do you think @phemmer?

we should probably document that plugins should never call acc.AddError and return the error

@sparrc
Copy link
Contributor

sparrc commented Feb 2, 2017

@cosmopetrich feel free to update the changelog

@phemmer
Copy link
Contributor

phemmer commented Feb 2, 2017

I'm for it.
But I think we should adjust line 164 as well.

@sparrc
Copy link
Contributor

sparrc commented Feb 3, 2017

done

@sparrc sparrc closed this in b1945c0 Feb 3, 2017
maxunt pushed a commit that referenced this pull request Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants