Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textfile: the same metric occurring with different labels in two different prom files trigger error due to inconsistent help message in processFile #1885

Closed
zmedico opened this issue Nov 14, 2020 · 14 comments · Fixed by #2475

Comments

@zmedico
Copy link

zmedico commented Nov 14, 2020

Host operating system: output of uname -a

Linux hostname 5.4.65 #1 SMP Tue Sep 15 21:22:36 UTC 2020 x86_64 AMD EPYC 7551P 32-Core Processor AuthenticAMD GNU/Linux

node_exporter version: output of node_exporter --version

# node_exporter --version
node_exporter, version 1.0.1 (branch: HEAD, revision: 3715be6ae899f2a9b9dbfd9c39f3e09a7bd4559f)
  build user:       root@1f76dbbcfa55
  build date:       20200616-12:44:12
  go version:       go1.14.4

node_exporter command line flags

node_exporter -- --collector.textfile.directory=/var/lib/node_exporter/ --collector.bonding --collector.buddyinfo --collector.ntp --no-collector.infiniband --no-collector.nfs --no-collector.nfsd --no-collector.wifi --no-collector.zfs --web.listen-address=127.0.0.1:9100

Are you running node_exporter in Docker?

No.

What did you do that produced an error?

Write two different prom files to the textfile collector directory, containing metrics with the same names but different labels. For example:

$ cat /var/lib/node_exporter/app-instance-1.prom
# TYPE app_metric_name counter
app_metric_name{instance="1"} 0
$ cat /var/lib/node_exporter/app-instance-2.prom
# TYPE app_metric_name counter
app_metric_name{instance="2"} 0

What did you expect to see?

Expected to collect metrics with same names but different labels from both prom files.

What did you see instead?

Metrics were successfully collected from the first prom file, and metrics from the second prom file were rejected by the prometheus.checkMetricConsistency function at prometheus/registry.go#L914

The inconsistency can be avoided by replacing the full path of the prom file with the parent directory name, like this:

diff --git a/collector/textfile.go b/collector/textfile.go
index 50c1807..f464d44 100644
--- a/collector/textfile.go
+++ b/collector/textfile.go
@@ -254,7 +254,7 @@ func (c *textFileCollector) processFile(name string, ch chan<- prometheus.Metric
 
 	for _, mf := range families {
 		if mf.Help == nil {
-			help := fmt.Sprintf("Metric read from %s", path)
+			help := fmt.Sprintf("Metric read from %s", c.path)
 			mf.Help = &help
 		}
 	}

The error was not logged, but I was able to log it with the following patch:
* [from Gatherer #2] collected metric app_metric_name label:<name:"instance" value:"2" > counter:<value:0 > has help "Metric read from /var/lib/node_exporter/app-instance-2.prom" but should have "Metric read from /var/lib/node_exporter/app-instance-1.prom"

--- node_exporter-1.0.1/node_exporter.go
+++ node_exporter-1.0.1/node_exporter.go
@@ -31,6 +31,7 @@
 	"github.com/prometheus/node_exporter/collector"
 	"github.com/prometheus/node_exporter/https"
 	kingpin "gopkg.in/alecthomas/kingpin.v2"
+	stdlog "log"
 )
 
 // handler wraps an unfiltered http.Handler but uses a filtered handler,
@@ -121,6 +122,7 @@
 	handler := promhttp.HandlerFor(
 		prometheus.Gatherers{h.exporterMetricsRegistry, r},
 		promhttp.HandlerOpts{
+			ErrorLog:            stdlog.New(os.Stderr, "", stdlog.LstdFlags),
 			ErrorHandling:       promhttp.ContinueOnError,
 			MaxRequestsInFlight: h.maxRequests,
 			Registry:            h.exporterMetricsRegistry,
@zmedico zmedico changed the title textfile: the same metric occurring with different labels in two different prom files due to inconsistent help message in processFile textfile: the same metric occurring with different labels in two different prom files trigger error due to inconsistent help message in processFile Nov 14, 2020
@zmedico
Copy link
Author

zmedico commented Nov 14, 2020

Looks like this is the same root cause as #1600.

@discordianfish
Copy link
Member

The inconsistency can be avoided by replacing the full path of the prom file with the parent directory name, like this:

Might be a good option. @SuperQ wdyt? And maybe @juliusv has an opinion on this.

@brian-brazil
Copy link
Contributor

IIRC, @SuperQ previously said that the use case of having the same metric in two files was explicitly not supported.

The directory name wouldn't be much use, as it'd be the same across all metrics so we'd no longer be defaulting in a value to help you figure out which metric came from which file.

@discordianfish
Copy link
Member

Makes sense, but I still like to support the use case of having two separate processes report the same metric via the textfile exporter. If they'd need to write the same metrics file, they need to coordinate/merge their state somehow which seems annoying.. Any alternative suggestions?

@juliusv
Copy link
Member

juliusv commented Nov 24, 2020

If someone wants to contribute code for it, the most optimal solution to this problem would seem to be to group identically named metrics from different files into a single output group with just one HELP comment saying something like Metric read from <file-1> and <file-2>.

@brian-brazil
Copy link
Contributor

brian-brazil commented Nov 24, 2020

Any alternative suggestions?

If both files contain the same explicit HELP then this logic doesn't apply, and it'll all work.

@juliusv
Copy link
Member

juliusv commented Nov 24, 2020

This is true, but it would be even better to make things work automatically even if the user forgot to provide their own HELP strings. Especially since this is not an error that you notice on startup, so for many people this will be a silent error that they don't notice unless they're super ontop of their monitoring.

@brian-brazil
Copy link
Contributor

That's partially due to the ContinueOnError, which is a dangerous setting as it hides failures like this in a non-deterministic way.

@juliusv
Copy link
Member

juliusv commented Nov 24, 2020

Agreed, the Node Exporter sets ContinueOnError here:

ErrorHandling: promhttp.ContinueOnError,

That seems dangerous, but I'm not deep enough into the Node Exporter to say what kinds of things it would break if we set that to e.g. HTTPErrorOnError instead.

@brian-brazil
Copy link
Contributor

Basically if there's a problem the whole scrape would fail with HTTPErrorOnError, rather than effectively silently failing in a non-deterministic way like it current does.

@juliusv
Copy link
Member

juliusv commented Nov 24, 2020

Yes, that much is clear. I was just wondering if that would be too problematic to change it now. I was wondering why ContinueOnError was initially introduced into the Node Exporter, and it was in dace41e, explicitly for the textfile collector.

@lymweb
Copy link

lymweb commented May 31, 2021

same problem

@discordianfish
Copy link
Member

Yeah this is still an open issue. Feel free to implement Julius suggestion:

If someone wants to contribute code for it, the most optimal solution to this problem would seem to be to group identically named metrics from different files into a single output group with just one HELP comment saying something like Metric read from and .

@wkhere
Copy link

wkhere commented Jan 27, 2022

Ouch, I also have hit that, and would benefit from such metrics merging.

kai11 pushed a commit to kai11/cronmanager that referenced this issue Mar 28, 2022
Instead of single crons.prom file, now each job will have it's own.
However, by default node_exporter' textfile don't merge them and use
only first one.

Details in prometheus/node_exporter#1885

Updates:
* instead of single crons.prom file, each job now have - and locks - it's own
* added same HELP header to all files
* cron_job metric renamed to cronjob
* minor README updates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants