-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service Telemetry Prometheus metrics server does not properly report error #11417
Labels
bug
Something isn't working
Comments
@omzmarlon first thing first, please change to use readers. See https://github.com/open-telemetry/opentelemetry-collector/releases/tag/v0.111.0 deprecation notice. |
github-merge-queue bot
pushed a commit
that referenced
this issue
Jan 27, 2025
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description Pass the missing async error channel into telemetry.Settings <!-- Issue number if applicable --> #### Link to tracking issue Fixes #11417 <!--Describe what testing was performed and which tests were added.--> #### Testing With the same setup as in #11417, building new otelcol with the changes in this PR, and running 2 instances with the same config using the same metric port, we would see proper crash error messages: ``` # config used: receivers: nop: exporters: nop: service: pipelines: logs: receivers: - nop exporters: - nop telemetry: metrics: readers: - pull: exporter: prometheus: host: localhost port: 8889 ``` ``` # first instance log: ./otelcol-custom --config otel-config.yaml 2025-01-16T17:36:34.638-0800 info [email protected]/service.go:165 Setting up own telemetry... 2025-01-16T17:36:34.638-0800 info telemetry/metrics.go:70 Serving metrics {"address": "localhost:8889", "metrics level": "Normal"} 2025-01-16T17:36:34.639-0800 info [email protected]/service.go:231 Starting otelcol-custom... {"Version": "", "NumCPU": 16} 2025-01-16T17:36:34.639-0800 info extensions/extensions.go:39 Starting extensions... 2025-01-16T17:36:34.639-0800 info [email protected]/service.go:254 Everything is ready. Begin running and processing data. ``` ``` # second instance's log (using same config) ./otelcol-custom --config otel-config.yaml 2025-01-16T17:36:37.270-0800 info [email protected]/service.go:165 Setting up own telemetry... 2025-01-16T17:36:37.270-0800 info telemetry/metrics.go:70 Serving metrics {"address": "localhost:8889", "metrics level": "Normal"} 2025-01-16T17:36:37.271-0800 info [email protected]/service.go:231 Starting otelcol-custom... {"Version": "", "NumCPU": 16} 2025-01-16T17:36:37.271-0800 info extensions/extensions.go:39 Starting extensions... 2025-01-16T17:36:37.271-0800 info [email protected]/service.go:254 Everything is ready. Begin running and processing data. 2025-01-16T17:36:37.273-0800 error [email protected]/collector.go:325 Asynchronous error received, terminating process {"error": "listen tcp 127.0.0.1:8889: bind: address already in use"} go.opentelemetry.io/collector/otelcol.(*Collector).Run go.opentelemetry.io/collector/[email protected]/collector.go:325 go.opentelemetry.io/collector/otelcol.NewCommand.func1 go.opentelemetry.io/collector/[email protected]/command.go:36 github.com/spf13/cobra.(*Command).execute github.com/spf13/[email protected]/command.go:985 github.com/spf13/cobra.(*Command).ExecuteC github.com/spf13/[email protected]/command.go:1117 github.com/spf13/cobra.(*Command).Execute github.com/spf13/[email protected]/command.go:1041 main.runInteractive go.opentelemetry.io/collector/cmd/builder/main.go:49 main.run go.opentelemetry.io/collector/cmd/builder/main_others.go:10 main.main go.opentelemetry.io/collector/cmd/builder/main.go:42 runtime.main runtime/proc.go:272 2025-01-16T17:36:37.273-0800 info [email protected]/service.go:296 Starting shutdown... 2025-01-16T17:36:37.274-0800 info extensions/extensions.go:66 Stopping extensions... 2025-01-16T17:36:37.274-0800 info [email protected]/service.go:310 Shutdown complete. ``` <!--Please delete paragraphs that you did not use before submitting.--> --------- Co-authored-by: Antoine Toulme <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
While using Prometheus server for
service.telemetry.metrics
, if the Prometheus server cannot start (e.g. the port used already occupied), OTel framework seems to swallow that error and does not properly report it so that the user/caller can do something about it.Steps to reproduce
It's pretty easy to reproduce with just very simple setup.
I am using the tag
v0.111.0
(latest available at the time of writing this issue)First I built a otelcol binary using the OTel Builder tool. For builder config I used the config in there repo here with some small adjustments:
The config used to run the otelcol binary is this:
Now if I run 2 instances of the otelcol binary using this same config (i.e. both will try to bind to port 8889 for metrics), the logs will be exactly the same:
Neither instance's logs do not say anything about port 8889 is already in use.
Since I was using
replace
in the otelcol builder config, I was able to manually insert somefmt.Printf
into the OTel source code to manually print error.I had basically change this line to this:
Now if I rebuild the otelcol binary and run 2 instances of it again, I do see the address already in use error from 1 of the 2 instances:
I think the cause of this issue is this:
When OTel tries to report the metrics server error to the
asyncErrorChannel
, the channel is actually not set and is nil. I also manually added some print statements and verified that theasyncErrorChannel
param passed to here is indeed nil. This will basically cause thecase asyncErrorChannel <- serveErr:
to never be executed.I think the channel is nil because here when it is setting up telemetry.Settings it did not set the
AsyncErrorChannel
. ()We actually didn't have this problem when we were using an earlier version of OTel (e.g. v0.108.1) because it did set up the async error channel properly here.
This is quite problematic because if a otel pipeline cannot bind to a port to report metrics, we have no proper way to know it (other than when Prometheus actually starts scraping and we see repeated failures) so that we can retry on a different port.
What did you expect to see?
OTel should either report the metrics server error through the asyncErrorChannel or return the error so that it can be detected at start:
Start(ctx context.Context, host Host) error
What did you see instead?
The metrics server error is completely swallowed and there is not even a log about it.
What version did you use?
What config did you use?
Environment
OS: MacOS and Linux
Compiler: go 1.22
Additional context
n/a
The text was updated successfully, but these errors were encountered: