-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture Wire Protocol design goals #21
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Design Goals for OpenTelemetry Wire Protocol | ||
|
||
We want to design a telemetry data exchange protocol that has the following characteristics: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this intended to be a single protocol for different types of telemetry? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My past experience is with traces and logs (but not metrics). I think the answer should be yes, one protocol for all types, unless we think other telemetry types have significantly different requirements for the protocol. Thoughts are welcome. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this makes sense... there is a possible issue with the fact that metrics systems do not consume traces, and there are perhaps emerging standards for wire protocols in that space. How do we play well with prometheus, etc? Or is that out of scope? AKA we will always be doing some data conversion/processing in a gateway node between the telemetry network and the metrics backend. |
||
|
||
- Be suitable for use between all of the following node types: instrumented applications, telemetry backends, local agents, stand-alone collectors/forwarders. | ||
|
||
- Have high reliability of data delivery and clear visibility when the data cannot be delivered. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for "data cannot be delivered" - is this the only type of errors? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is one of the problems that we encountered in production particularly with OpenCensus protocol. This requirement can be refined further if there is an evidence of other common error types. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we think about de-duplicating the data? (e.g. in case of transmission failures, where sender has no idea whether the backend accepted the data or not) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll have this in the RFC. In short, I suggest to aim for reliable delivery at the cost of possible duplicates. Details in the RFC. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you, I'll look into the RFC. |
||
|
||
- Have low CPU usage for serialization and deserialization. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: I prefer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is worth listing CPU and Memory separately. They are related but somewhat distinct with possibly different solutions. Especially the memory, I wanted to mention it in a separate paragraph so that we realize the specific pattern of allocating and using memory that some protocols tend to require can be quite unfriendly to GCs. |
||
|
||
- Impose minimal pressure on memory manager, including pass-through scenarios, where deserialized data is short-lived and must be serialized as-is shortly after and where such short-lived data is created and discarded at high frequency (think telemetry data forwarders). | ||
|
||
- Support ability to efficiently modify deserialized data and serialize again to pass further. This is related but slightly different from the previous requirement. | ||
|
||
- Ensure high throughput (within the available bandwidth) in high latency networks (e.g. scenarios where telemetry source and the backend are separated by high latency network). | ||
|
||
- Allow backpressure signalling. | ||
|
||
- Be load-balancer friendly (do not hinder re-balancing). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At what granularity are we load balancing? Per span? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Typically per batch of spans. L7 load balancers trigger rebalancing in specific situations. For gRPC we found that it typically happens when gRPC stream is opened. By controlling what you include in one stream you can control the frequency of rebalancing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should backwards compatibility be part of the goals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean backwards compatibility with any existing protocol? I think that should be a non-goal.
If you mean backwards compatibility within its own revisions then I think it goes without saying. That should be a goal of any protocol that can be deployed on a variety of devices at different times by different people. We can call this out specifically if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean within its own versions. Agreed it may already be implied and not need to point it specifically.