Skip to content
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions api/base.proto
Original file line number Diff line number Diff line change
Expand Up @@ -227,3 +227,11 @@ message TransportSocket {
// See the supported transport socket implementations for further documentation.
google.protobuf.Struct config = 2;
}

// Percent, typically used to specify things like target sampling percentages among tracing requests
// (as in, e.g., :ref:`HTTP Connection Manager tracing
// <envoy_api_field_filter.network.HttpConnectionManager.tracing>`).
message Percent {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some simple comment for docs purposes (e.g. explaining purpose, valid ranges).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, fixed.

// The percent, a float between 0 and 1.
float value = 1 [(validate.rules).float = {gte: 0, lte: 1}];

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also say that I don't feel very strongly about 0.0-1.0 if people prefer 0.0-100.0.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer ~0.0-~1.0, which I think most people would say is more intuitive.

I would consider replacing 1.0 with some other number (e.g., 100) if we were planning on users needing that extra resolution between numbers. I do not get the sense they will, though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the message is called percent, the value range should be [0, 100]. That is how most people interpret English words.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I think per above we should switch to RoundedPercent, double, 0.0-100.0

}
18 changes: 18 additions & 0 deletions api/filter/network/http_connection_manager.proto
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,24 @@ message HttpConnectionManager {
// populate the tag name, and the header value is used to populate the tag value. The tag is
// created if the specified header name is present in the request's headers.
repeated string request_headers_for_tags = 2;

// Global target percentage of requests that will be force traced if the *x-client-trace-id*
// header is set. Percent is resolved to the nearest 1% (rounded down). Defaults to 1 (i.e.,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new things in the v2 API, I think we have the chance to normalize the gradients which will be much less confusing to people. I would recommend doing the following:

  1. Documenting on the Percent message itself that the float value will be normalized to 0.01% increments. (Internally we will multiply by 10,000).
  2. Then we can remove the rounding documentation from each of these such that it's inferred that everything is 0.01% increments.

For legacy code in which a Percent object is not defined, we can use the legacy runtime dividend. For code in which Percent is defined, we can use 10,000 as the dividend. I would recommend we also deprecate the old dividends that are not 100 in runtime and then just delete them in the next release with a release note.

I realize this is more work, but it will make the future situation much easier to understand.

How does this sound?

@hausdorff hausdorff Dec 31, 2017

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm pretty easy-going about landing the patches correctly even if it's slower.)

I am ok with having all Percent objects resolve at 0.01% increments as long as @htuch agrees. But while it's up in the air, I'm a bit confused about the semantics.

What's the reason we're using an integral type instead of a float-ish type to represent stuff like sampling percentage targets? Is it just to force people to think in target percentages that are perceptibly different over a few thousand requests? If so then representing it as a float just seems like a weird tool, since we're letting people express things we won't execute on. (EDIT: Especially since it's possible to have something weird like 1.0000000001 or something, which I suspect would fail the bounds check above.)

And if not, then I'm not sure an arbitrary truncation horizon is clearly better than just letting good old IEEE 754 "solve" the problem incidentally by the way it specifies float semantics.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no good historical reason that the config didn't always use floats. It's just the way it was done. Internally in the code, dealing with integers is substantially faster when computing randomness and random chance hits, so we want to be able to convert any float config into integers for computation.

I actually don't care if we want to make Percent very clearly an integer between 0 and 10,000 and explain that it will be converted to 0.01% increments internally. I just figured most users would prefer to work with a float. I'm fine either way.

@wora might have some thoughts on this as well. I would prefer to wait for him to chime in before we merge this so it will likely need to wait till early next week.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S., the reason 10,000 is used is that random result can be inferred from 2 bytes of entropy which we make use of in a few places (UUID stable sampling comes to mind).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. There should be no rush on API changes!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should avoid using float unless it is data size critical, such as colors. JSON cannot represent float, so you end up float<->double conversions everywhere, and causes data loss.

Envoy is a basic infrastructure. So I don't feel Envoy should enforces the minimum ratio of 0.01%. Why is this something Envoy needs to decide? Can we leave the problem to the operators.

If we do enforces 0.01% ratio, then we should change Percent to use integers, so we don't need to explain the rounding problem because integer forces rounding anyway.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing Percent to RoundedPercent so it's more clear to the user? We can also switch to double and make the range 0.0-100.0 per @wora. Then I think there will be no confusion that we are going to round internally for performance reasons.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need, and will likely never need, the extra resolution, but it's also not a big deal to just use double. Let's go with double.

// 100%). This variable is a direct analog for the variable of the same name in the :ref:`HTTP
// Connection Manager <config_http_conn_man_runtime>`.
Percent client_enabled = 3;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field name is very vague. It is more like "client_enabled_sampling".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This was an unfortunate internal name choice. Let's fix this in the config. I would probably just do client_sampling personally.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


// Global target percentage of requests that will be traced after all other checks have been

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept of "after all other checks" is questionable here. In a distributed system, we can not assume there is a fixed order of checks. It doesn't seem proper to assume tracing has the privilege of being the last decider.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this is per-HTTP connection manager, which is not distributed, so I think we should just get rid of the "global" and call that out specifically.

// applied (force tracing, sampling, etc.). Percent is resolved to the nearest 1% (rounded
// down). Defaults to 1 (i.e., 100%). This variable is a direct analog for the variable of the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal lesson with infrastructure design is everything should be default to 0. Don't assume everyone wants my feature no matter how much I love it myself.

If I add a new feature to a system, I cannot touch all existing customers, so the new feature must be default off. Instead of arguing whether a new feature should be default on, we should have a hard policy, everything is default off, and don't spend time debating on it. We never know what users want to do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree that default 0 would be better, that's not how the current code works and we can't change it. Se la vi.

// same name in the :ref:`HTTP Connection Manager <config_http_conn_man_runtime>`.
Percent global_enabled = 4;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not design an API based on another API unless the concept is well know, such as country_code. From reading this comment, I don't understand what "global enabled" means to me, either as a developer or an operator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup let's clean this up. Let's just call this sampled to go with client_sampled above.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


// Global target percentage of requests that will be randomly traced. Percent is resolved to the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept of Global is strange here. If I run a server, that server can be a standalone server, or a server in a cluster, or a zone, or a region, or many regions. What does "Global" mean here? Does my server have to talk to a global service to do the tracing?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it's per HTTP connection manager (as I understand it). Fixed.

// nearest 0.01%, rounded down. Defaults to 1 (i.e., 100%). This variable is a direct analog for

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really should get out of the rounding business. If I am running a memcache service with millions of requests per second, why do I have to follow such policy?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API docs are telling you what Envoy itself is going to do. We need to explain to users of the API what the effect of setting a specific value will have, so somewhere we need to be clear about how fine grained percentages/ratios will be interpreted in practice.

// the variable of the same name in the :ref:`HTTP Connection Manager
// <config_http_conn_man_runtime>`.
Percent random_sampling = 5;
}

// Presence of the object defines whether the connection manager
Expand Down