-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project Tracking: Performance Benchmarking SIG #1617
Comments
@puckpuck fyi |
Please delete boilerplate like this form the description to make it easier to read:
There is more like that that seems to be copied from a template and should be deleted/replaced by more specifics. |
Does this refer to this document? |
cc @gsoria and @harshita19244, as they worked on performance benchmarks for SDKs at different stages (OpenTracing and OpenTelemetry) and can share their experience in doing so. |
@cartersocha I'd be happy to be the second GC sponsor supporting this Performance Benchmarking SIG. I recommend creating a Charter doc for this SIG to map out more details about the mission, goals, deliverables and logistics for this SIG. Let's also itemize what items are out of scope and non-goals since performance benchmarking is a subjective area for an open source project of OpenTelemetry's breadth and depth. Please share link on this thread. |
Hi, I worked on the performance benchmarking project to compare the performance of the Opentracing and the Opentelemetry libraries as a part of my Outreachy internship. All tests were executed on bare metal machines. Please find the GitHub repo here: https://github.com/harshita19244/opentelemetry-java-benchmarks |
Over in PHP SIG, we've implemented (most of) the documented perf tests, but what I think we lack is a way to run them on consistent hardware, and a way to publish the results (or compare to a benchmark to track regressions/improvements). |
@brettmc already made an ask for bare metal machines that was approved. I’ll share the details once we get them cncf/cluster#245 |
Thx @cartersocha for starting this!
I would be super interested in participating. Recently @sh0rez started a project to compare the grafana-agent and the Prometheus-agent performance in collecting metrics. Since its quite flexible, it wasn't to hard to extend it to include the open telemetry collector. Maybe its beneficial for this project, happy to chat about it. |
Would love to see the data / results or hear about any testing done here @frzifus. Thanks for being willing to share your work 😎 |
Added a charter to the proposal as @alolita suggested. |
👍 |
Looking forward to seeing this go forward! cc @tobert |
hey @frzifus @sh0rez @harshita19244 @gsoria @brettmc we now have bare metal machines to run tests on. I wasn't sure how to add all of you on slack but we're in the CNCF slack otel-benchmarking channel. |
In java we've taken performance fairly seriously, and continue to make improvements as we receive feedback. For example, we received an issue about a use case in which millions of distinct metric series may need to be maintained in memory, and feedback that the SDK at the time would produce problematic memory churn. Since receiving, we worked to reduce metric memory allocation by 80%, and there is work in progress to reduce it by 99.9% (essentially zero memory allocations after the metric SDK reaches a steady state). We also have performance test suites for many sensitive areas and validate that changes to sensitive areas don't degrade performance. All this is to say that I believe we have a decent performance story today. However, where I think we could improve is in documentation for the performance to point curious users to. Our performance test suites require quite a bit of context to run and interpret the results. It would be great if we could extend the spec performance benchmark document to include high level descriptions of some use cases for each signal, and to provide tooling to be able to run and publish performance results to some central location. If the above was available, we would have some nice material to point users to who are evaluating the project. We would still keep the nuanced performance tests around for sensitive areas, but it would be good to have something simpler / higher level. In general, I think performance engineering is going to be very language / implementation dependent. I would caution against too expansive of a scope for a cross-language performance group. It would be great to provide some documentation of use cases to evaluate in suites, and tooling for running on bare metal / publishing results. But there are always going to be nuanced language specific concerns. I think we should raise those issues with the relevant SIGs, and let those maintainers / contributors work out solutions. |
I have similar position with @jack-berg. Taking OpenTelemetry .NET as an example, performance has been taken care of seriously from the beginning:
Thinking about what could potentially benefit OpenTelemetry .NET, having some perf numbers published to an official document on opentelemetry.io across all programming languages might increase the discoverability. |
Thanks for the context all. @jack-berg could you share where the java tests are published and what compute they run on? @reyang could you share what compute you rely on in dotnet and consider migrating the test results to the otel website like the collector does? |
The tests are scattered throughout the repo in directories next to the source they evaluate. All the directories contain "jmh". I wrote a quick little script to find them all:
They run on each developers local machine, and only on request. The basic idea is that maintainers / approvers know which areas of the code are sensitive and have JMH test suites. When someone opens a PR which we suspect has performance implications, we ask them to run the performance suite before and after and compare the results (example). Its obviously imperfect, but has generally been fine. It would be good if there was an easy way to run a subset of these on stable compute and publish the results to a central place. I think running / publishing all of them might be overwhelming. |
Makes sense. Thanks for sharing those details. Let me start a thread in the cncf slack to coordinate machine access |
A random find that I just stumbled across. K6 extension for generating OTEL signals created by an ING Bank engineer https://github.com/thmshmm/xk6-opentelemetry I'm not sure what the guidelines on usage of 3rd party tooling are for the Performance Benchmarking SIG. |
Thanks for sharing @cwegener! The guidelines are to be defined so we’ll see but the general preference is for community tooling (which can also be donated). We’re a decentralized project and each language has its quirks so whatever guidelines that would be defined would be more of a baseline. If you think this approach would be generally beneficial we’d love to hear more. Feel free to cross post in the #otel-benchmarking channel |
I will test drive the k6 extension myself a little bit and report back in Slack. |
@cartersocha do you mind converting this issue to a PR? We are now placing proposals here: https://github.com/open-telemetry/community/tree/main/projects |
@cartersocha @jpkrohling @tylerbenson is this SIG ongoing currently, or should we close / turn this into project proposal PR until folks are ready to move forward |
While the infrastructure is still in place and can continue to be used by individual SIGs, I don't think we got volunteers from anyone outside of Lightstep and we've all moved along to other projects. I think the proposal can be closed. |
Description
As the adoption of OpenTelemetry grows and larger enterprises continue to deepen their usage of project components there are persistent and ongoing end user questions about the OpenTelemetry performance impact. End user performance varies due to the quirks of their environment but without a project performance standard and historical data record no one really knows if the numbers they're seeing are abnormal or expected. Additionally, there is no comprehensive documentation available on tuning project components or the performance trade-offs available to users which results in a reliance on vendor support.
Project Maintainers need to be able to track the current state of their components while preventing any performance regressions when making new releases. Customers need to be able to get a general sense of potential OpenTelemetry performance impact and the certainty that OpenTelemetry takes performance and customer resources seriously. Performance tracking and quantification is a general need that should be addressed by a project wide effort and automated tooling that minimizes repo owner effort while providing valuable new data points for all project stakeholders.
Project Board
SIG Charter
charter
Deliverables
Initial implementation scope would be the core Collector components (main repo), JavaScript / Java / Python SDKs and their core components. No contrib or instrumentation.
Staffing / Help Wanted
Anyone with an opinion on performance standards and testing.
Language maintainers or approvers as they will be tasked with implementing the changes and following through on the process.
Required staffing
lead - tbd
@jpkrohling domain expert
@cartersocha contributor
@mwear collector sig
@codeboten collector sig implementation
@ocelotl python sig
@martinkuba javascript
@tylerbenson java
@sbaum1994 contributor
@jpkrohling - TC/GC sponsor
@alolita - TC/GC sponsor
Need: more performance domain experts
Need: maintainers or approvers from several language sigs to participate
Meeting Times
TBD
Timeline
Initial scope is for the Collector and 3 SDKs. Output should be by KubeCon NA November 6, 2023
Labels
tbd
Linked Issues and PRs
https://opentelemetry.io/docs/collector/benchmarks/
cncf/cluster#245
cncf/cluster#182
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/performance-benchmark.md
https://opentelemetry.io/docs/specs/otel/performance-benchmark/
The text was updated successfully, but these errors were encountered: