Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture Wire Protocol design goals #21

Merged

Conversation

tigrannajaryan
Copy link
Member

This is a follow up from Spec SIG meeting on Jul 18, 2019, recapping
the topics that I mentioned in the meeting.

The document will help us design the right wire protocol.

This is a follow up from Spec SIG meeting on Jul 18, 2019, recapping
the topics that I mentioned in the meeting.

The document will help us design the right wire protocol.

- Be suitable for use between all of the following node types: instrumented applications, telemetry backends, local agents, stand-alone collectors/forwarders.

- Have high reliability of data delivery and clear visibility when the data cannot be delivered.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for "data cannot be delivered" - is this the only type of errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the problems that we encountered in production particularly with OpenCensus protocol. This requirement can be refined further if there is an evidence of other common error types.

@@ -0,0 +1,19 @@
# Design Goals for OpenTelemetry Wire Protocol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should backwards compatibility be part of the goals?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean backwards compatibility with any existing protocol? I think that should be a non-goal.

If you mean backwards compatibility within its own revisions then I think it goes without saying. That should be a goal of any protocol that can be deployed on a variety of devices at different times by different people. We can call this out specifically if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean within its own versions. Agreed it may already be implied and not need to point it specifically.

Copy link
Member

@songy23 songy23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -0,0 +1,19 @@
# Design Goals for OpenTelemetry Wire Protocol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean within its own versions. Agreed it may already be implied and not need to point it specifically.


- Have high reliability of data delivery and clear visibility when the data cannot be delivered.

- Have low CPU usage for serialization and deserialization.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer resource instead of CPU since we are at high-level.

Copy link
Member Author

@tigrannajaryan tigrannajaryan Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth listing CPU and Memory separately. They are related but somewhat distinct with possibly different solutions. Especially the memory, I wanted to mention it in a separate paragraph so that we realize the specific pattern of allocating and using memory that some protocols tend to require can be quite unfriendly to GCs.

@@ -0,0 +1,19 @@
# Design Goals for OpenTelemetry Wire Protocol

We want to design a telemetry data exchange protocol that has the following characteristics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended to be a single protocol for different types of telemetry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My past experience is with traces and logs (but not metrics). I think the answer should be yes, one protocol for all types, unless we think other telemetry types have significantly different requirements for the protocol. Thoughts are welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense... there is a possible issue with the fact that metrics systems do not consume traces, and there are perhaps emerging standards for wire protocols in that space. How do we play well with prometheus, etc? Or is that out of scope? AKA we will always be doing some data conversion/processing in a gateway node between the telemetry network and the metrics backend.


- Allow backpressure signalling.

- Be load-balancer friendly (do not hinder re-balancing).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what granularity are we load balancing? Per span?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically per batch of spans. L7 load balancers trigger rebalancing in specific situations. For gRPC we found that it typically happens when gRPC stream is opened. By controlling what you include in one stream you can control the frequency of rebalancing.
Doing it per span is likely going to be prohibitively expensive.


- Be suitable for use between all of the following node types: instrumented applications, telemetry backends, local agents, stand-alone collectors/forwarders.

- Have high reliability of data delivery and clear visibility when the data cannot be delivered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we think about de-duplicating the data? (e.g. in case of transmission failures, where sender has no idea whether the backend accepted the data or not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have this in the RFC. In short, I suggest to aim for reliable delivery at the cost of possible duplicates. Details in the RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'll look into the RFC.

@tigrannajaryan
Copy link
Member Author

Reviewers, if there are no objections about these goals, let's merge this.

@tigrannajaryan
Copy link
Member Author

Reviewers, I need more comments or one more approval to merge this.

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tigrannajaryan
Copy link
Member Author

Approvers, please merge.

@yurishkuro yurishkuro merged commit 2ccb7cb into open-telemetry:master Jul 23, 2019
@tigrannajaryan tigrannajaryan deleted the feature/tigran/designgoals branch July 23, 2019 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants