Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] Consider converting from pmetrics to prometheus data model in parallel. #21106

Closed
rapphil opened this issue Apr 21, 2023 · 8 comments

Comments

@rapphil
Copy link
Contributor

rapphil commented Apr 21, 2023

Component(s)

exporter/prometheusremotewrite

Is your feature request related to a problem? Please describe.

While performing load tests with the prometheusremotewrite exporter, I was able to identify a bottleneck that could be further optmized.

I used a very simple collector configuration in the load tests:

extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1777
  sigv4auth:
    region: "us-west-2"

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['127.0.0.1:8888']

processors:
  batch:
    send_batch_size: 128000

exporters:
  logging:
    loglevel: debug
  awsxray:
    region: 'us-west-2'
  awsemf:
    region: 'us-west-2'
  prometheusremotewrite:
    endpoint: http://localhost:8080

  prometheusremotewrite/telemetry:
    endpoint: "https://<endpoint>"
    auth:
      authenticator: sigv4auth



service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]
    metrics/2:
      receivers: [prometheus]
      exporters: [prometheusremotewrite/telemetry]


  extensions: [pprof, sigv4auth]
  telemetry:
    logs:
      level: debug
    metrics:
      level: detailed
      address: 0.0.0.0:8888

The target of prometheusremotewrite is a dummy web server that will only accept data and will return instantly.

We then run a distributed load tests using locust to ingest data into the otlp endpoint using otlp over http. We ingested load to the point were prometheusremotewrite was not able to keep up with the amount of data that was being ingested, and data started to accumulate in the queue.

I decided to profile the collector while it was under load, and I got to this:

image

After inspecting the code I noticed that the conversion of data from pmetric to the prometheus data model happens sequentially.

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/exporter.go#L140

Describe the solution you'd like

I would like to propose that the conversion of the metrics from pmetric to prometheus data model can be parallelized. This can be done with a configurable parameter for the parallelism level and the algorithm should partition the data into chunks that are converted in parallel and them finally merged.

Describe alternatives you've considered

One natural way of mitigating this issue is just adding more collectors. However this comes with its own set of problems and challanges. Ideally each collector should scale to take most of the hardware where it is running.

Additional context

No response

@rapphil rapphil added enhancement New feature or request needs triage New item requiring triage labels Apr 21, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@rapphil
Copy link
Contributor Author

rapphil commented Apr 21, 2023

After opening the ticket, I noticed that this can be considered a duplicate of this one: #20741

However the approach suggested is different.

@atoulme
Copy link
Contributor

atoulme commented Apr 29, 2023

Would you like to close this ticket and work with #20741?

@atoulme atoulme removed the needs triage New item requiring triage label Apr 29, 2023
@Aneurysm9
Copy link
Member

I think the two issues are related but possibly susceptible to independent resolutions. This issue can be resolved without any changes to the handling of consumer count on the queued retry helper by increasing data conversion parallelism. I believe that @rapphil also had some ideas for how to safely increase export parallelism that would be more closely aligned with #20741.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 3, 2023
@rapphil
Copy link
Contributor Author

rapphil commented Jul 3, 2023

This is still relevant

@github-actions github-actions bot removed the Stale label Jul 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 4, 2023
Copy link
Contributor

github-actions bot commented Nov 3, 2023

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants