-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture Wire Protocol design goals #21
Capture Wire Protocol design goals #21
Conversation
This is a follow up from Spec SIG meeting on Jul 18, 2019, recapping the topics that I mentioned in the meeting. The document will help us design the right wire protocol.
|
||
- Be suitable for use between all of the following node types: instrumented applications, telemetry backends, local agents, stand-alone collectors/forwarders. | ||
|
||
- Have high reliability of data delivery and clear visibility when the data cannot be delivered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for "data cannot be delivered" - is this the only type of errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the problems that we encountered in production particularly with OpenCensus protocol. This requirement can be refined further if there is an evidence of other common error types.
@@ -0,0 +1,19 @@ | |||
# Design Goals for OpenTelemetry Wire Protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should backwards compatibility be part of the goals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean backwards compatibility with any existing protocol? I think that should be a non-goal.
If you mean backwards compatibility within its own revisions then I think it goes without saying. That should be a goal of any protocol that can be deployed on a variety of devices at different times by different people. We can call this out specifically if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean within its own versions. Agreed it may already be implied and not need to point it specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -0,0 +1,19 @@ | |||
# Design Goals for OpenTelemetry Wire Protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean within its own versions. Agreed it may already be implied and not need to point it specifically.
|
||
- Have high reliability of data delivery and clear visibility when the data cannot be delivered. | ||
|
||
- Have low CPU usage for serialization and deserialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I prefer resource
instead of CPU
since we are at high-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth listing CPU and Memory separately. They are related but somewhat distinct with possibly different solutions. Especially the memory, I wanted to mention it in a separate paragraph so that we realize the specific pattern of allocating and using memory that some protocols tend to require can be quite unfriendly to GCs.
@@ -0,0 +1,19 @@ | |||
# Design Goals for OpenTelemetry Wire Protocol | |||
|
|||
We want to design a telemetry data exchange protocol that has the following characteristics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intended to be a single protocol for different types of telemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My past experience is with traces and logs (but not metrics). I think the answer should be yes, one protocol for all types, unless we think other telemetry types have significantly different requirements for the protocol. Thoughts are welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense... there is a possible issue with the fact that metrics systems do not consume traces, and there are perhaps emerging standards for wire protocols in that space. How do we play well with prometheus, etc? Or is that out of scope? AKA we will always be doing some data conversion/processing in a gateway node between the telemetry network and the metrics backend.
|
||
- Allow backpressure signalling. | ||
|
||
- Be load-balancer friendly (do not hinder re-balancing). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At what granularity are we load balancing? Per span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically per batch of spans. L7 load balancers trigger rebalancing in specific situations. For gRPC we found that it typically happens when gRPC stream is opened. By controlling what you include in one stream you can control the frequency of rebalancing.
Doing it per span is likely going to be prohibitively expensive.
|
||
- Be suitable for use between all of the following node types: instrumented applications, telemetry backends, local agents, stand-alone collectors/forwarders. | ||
|
||
- Have high reliability of data delivery and clear visibility when the data cannot be delivered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we think about de-duplicating the data? (e.g. in case of transmission failures, where sender has no idea whether the backend accepted the data or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have this in the RFC. In short, I suggest to aim for reliable delivery at the cost of possible duplicates. Details in the RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I'll look into the RFC.
Reviewers, if there are no objections about these goals, let's merge this. |
Reviewers, I need more comments or one more approval to merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Approvers, please merge. |
This is a follow up from Spec SIG meeting on Jul 18, 2019, recapping
the topics that I mentioned in the meeting.
The document will help us design the right wire protocol.