Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture Wire Protocol design goals #21

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions specification/proto-design-goals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Design Goals for OpenTelemetry Wire Protocol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should backwards compatibility be part of the goals?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean backwards compatibility with any existing protocol? I think that should be a non-goal.

If you mean backwards compatibility within its own revisions then I think it goes without saying. That should be a goal of any protocol that can be deployed on a variety of devices at different times by different people. We can call this out specifically if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean within its own versions. Agreed it may already be implied and not need to point it specifically.


We want to design a telemetry data exchange protocol that has the following characteristics:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended to be a single protocol for different types of telemetry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My past experience is with traces and logs (but not metrics). I think the answer should be yes, one protocol for all types, unless we think other telemetry types have significantly different requirements for the protocol. Thoughts are welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense... there is a possible issue with the fact that metrics systems do not consume traces, and there are perhaps emerging standards for wire protocols in that space. How do we play well with prometheus, etc? Or is that out of scope? AKA we will always be doing some data conversion/processing in a gateway node between the telemetry network and the metrics backend.


- Be suitable for use between all of the following node types: instrumented applications, telemetry backends, local agents, stand-alone collectors/forwarders.

- Have high reliability of data delivery and clear visibility when the data cannot be delivered.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for "data cannot be delivered" - is this the only type of errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the problems that we encountered in production particularly with OpenCensus protocol. This requirement can be refined further if there is an evidence of other common error types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we think about de-duplicating the data? (e.g. in case of transmission failures, where sender has no idea whether the backend accepted the data or not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have this in the RFC. In short, I suggest to aim for reliable delivery at the cost of possible duplicates. Details in the RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'll look into the RFC.


- Have low CPU usage for serialization and deserialization.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer resource instead of CPU since we are at high-level.

Copy link
Member Author

@tigrannajaryan tigrannajaryan Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth listing CPU and Memory separately. They are related but somewhat distinct with possibly different solutions. Especially the memory, I wanted to mention it in a separate paragraph so that we realize the specific pattern of allocating and using memory that some protocols tend to require can be quite unfriendly to GCs.


- Impose minimal pressure on memory manager, including pass-through scenarios, where deserialized data is short-lived and must be serialized as-is shortly after and where such short-lived data is created and discarded at high frequency (think telemetry data forwarders).

- Support ability to efficiently modify deserialized data and serialize again to pass further. This is related but slightly different from the previous requirement.

- Ensure high throughput (within the available bandwidth) in high latency networks (e.g. scenarios where telemetry source and the backend are separated by high latency network).

- Allow backpressure signalling.

- Be load-balancer friendly (do not hinder re-balancing).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what granularity are we load balancing? Per span?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically per batch of spans. L7 load balancers trigger rebalancing in specific situations. For gRPC we found that it typically happens when gRPC stream is opened. By controlling what you include in one stream you can control the frequency of rebalancing.
Doing it per span is likely going to be prohibitively expensive.