Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reloading telegraf prevents prometheus output from updating #2282

Closed
cosmopetrich opened this issue Jan 18, 2017 · 3 comments · Fixed by #2309
Closed

Reloading telegraf prevents prometheus output from updating #2282

cosmopetrich opened this issue Jan 18, 2017 · 3 comments · Fixed by #2309
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@cosmopetrich
Copy link

cosmopetrich commented Jan 18, 2017

Bug report

Reloading telegraf via a SIGHUP appears to prevent the output from the prometheus_client plugin from updating, even though telegraf itself continues to collect metrics as normal.

Relevant telegraf.conf:

The config below can be used to replicate the issue. The same symptoms occur when using 'real' metrics (CPU, mem, etc).

[agent]
  interval = "5s"
  round_interval = true
  metric_buffer_limit = 1000
  flush_buffer_when_full = true
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = true
  quiet = false

[[outputs.prometheus_client]]
  listen = ":9126"

[[inputs.exec]]
commands = ['/bin/bash -c "echo testmetric $(date +%s) $(date +%s)"']
data_format = "graphite"

System info:

Telegraf 1.1.1 & 1.1.2 on Ubuntu 14.04.

I built 1.2 from source to see if the changes to prometheus_client caching would have any impact, but the issue still appears to be present there.

Steps to reproduce:

  1. Start telegraf with the config above, then hit the prometheus endpoint periodically.

    date +%s && curl -s localhost:9126/metrics | grep ^testmetric

    Telegraf's metric will be within a couple of seconds of the real time.

  2. Reload telegraf.

    date +%s && service telegraf reload
  3. Start hitting the prometheus endpoint again, the testmetric will remain as it was before the reload, even hours later.

    Telegraf itself will still report the correct value when called with -test.

    date +%s && telegraf -test | grep testmetric

Additional info:

There were previously reload-related issues with the prometheus_client plugin that were fixed in #1753.

Here's some sample debug log entries before/after a reload.

2017/01/18 01:05:15 I! Output [prometheus_client] buffer fullness: 2 / 1000 metrics. Total gathered metrics: 763. Total dropped metrics: 0.
2017/01/18 01:05:15 I! Output [prometheus_client] wrote batch of 2 metrics in 94.555µs
2017/01/18 01:05:17 I! Reloading Telegraf config
2017/01/18 01:05:17 I! Hang on, flushing any cached metrics before shutdown
2017/01/18 01:05:17 I! Output [prometheus_client] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 763. Total dropped metrics: 0.
2017/01/18 01:05:17 D! Attempting connection to output: prometheus_client
2017/01/18 01:05:17 D! Successfully connected to output: prometheus_client
2017/01/18 01:05:17 I! Starting Telegraf (version 1.1.2)
2017/01/18 01:05:17 I! Loaded outputs: prometheus_client
2017/01/18 01:05:17 I! Loaded inputs: inputs.exec
2017/01/18 01:05:17 I! Tags enabled: host=XXX
2017/01/18 01:05:17 I! Agent Config: Interval:5s, Quiet:false, Hostname:"XXX", Flush Interval:10s
2017/01/18 01:05:20 D! Input [inputs.exec] gathered metrics, (5s interval) in 4.174276ms
2017/01/18 01:05:25 D! Input [inputs.exec] gathered metrics, (5s interval) in 4.201608ms
2017/01/18 01:05:30 D! Input [inputs.exec] gathered metrics, (5s interval) in 4.124453ms
2017/01/18 01:05:30 I! Output [prometheus_client] buffer fullness: 3 / 1000 metrics. Total gathered metrics: 3. Total dropped metrics: 0.
2017/01/18 01:05:30 I! Output [prometheus_client] wrote batch of 3 metrics in 107.546µs
@cosmopetrich
Copy link
Author

cosmopetrich commented Jan 18, 2017

Here's an example from prometheus. This shows a few hours of cpu_idle data for a number of telegraf instances which received a config update via puppet that triggered a reload.

screen shot 2017-01-18 at 12 11 54

@sparrc sparrc added the bug unexpected problem or unintended behavior label Jan 20, 2017
sparrc added a commit that referenced this issue Jan 21, 2017
@sparrc
Copy link
Contributor

sparrc commented Jan 24, 2017

I've got a PR to fix this, but it requires Go version 1.8: #2309

As soon as Go 1.8 is released I will get that merged into master and it will be ready for the 1.3 telegraf release, or possibly a 1.2.1+ telegraf release

@sparrc sparrc added this to the 1.3.0 milestone Jan 24, 2017
sparrc added a commit that referenced this issue Jan 24, 2017
sparrc added a commit that referenced this issue Jan 24, 2017
@cosmopetrich
Copy link
Author

Great, thanks for the fast response!

sparrc added a commit that referenced this issue Feb 16, 2017
sparrc added a commit that referenced this issue Feb 16, 2017
maxunt pushed a commit that referenced this issue Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants