Skip to content

Add initial Telemetry API with Tracing support for sampling and custom tags#1740

Merged
istio-testing merged 15 commits intoistio:masterfrom
douglas-reid:telemetry-api-part-one
Mar 26, 2021
Merged

Add initial Telemetry API with Tracing support for sampling and custom tags#1740
istio-testing merged 15 commits intoistio:masterfrom
douglas-reid:telemetry-api-part-one

Conversation

@douglas-reid
Copy link
Copy Markdown
Contributor

@douglas-reid douglas-reid commented Nov 10, 2020

This PR adds support for new Telemetry resource. In particular, it adds Tracing rules to enable workload-specific customization of tracing behavior. This is based on design discussions from Proposal: Telemetry API document (and the RFCs that preceded it).

The purpose of the new Telemetry resource is to enable a number of use cases around telemetry generation customization through a stable, supportable API that is much more operator-friendly than the current EnvoyFilter-based approaches in use today. Fuller justification of the PR (and the issues that is attempting to address) can be found in the proposal.

The Telemetry resource will hold provider-agnostic telemetry override configuration for workloads. It is meant to be used by Application Operators to customize telemetry generation for their workloads. In this PR, the Telemetry resource only configures tracing behavior to limit the overall scope and discussion while still providing insight into how the two additions will work together. Logging and metrics support will be quick-follows upon approval of this PR.

As the new resources mature, and full support is added to the control plane, subsequent PRs will be required to un-hide these changes in the docs and to label the various bits of MeshConfig and ProxyConfig that are being replaced as deprecated.

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Nov 10, 2020
@istio-policy-bot
Copy link
Copy Markdown

😊 Welcome @douglas-reid! This is either your first contribution to the Istio api repo, or it's been
awhile since you've been here.

You can learn more about the Istio working groups, code of conduct, and contributing guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@google-cla google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Nov 10, 2020
@istio-testing
Copy link
Copy Markdown
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 10, 2020
@douglas-reid
Copy link
Copy Markdown
Contributor Author

/test all

Copy link
Copy Markdown
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall 👍 for me

// $hide_from_docs
// Next available field number: 58
message TelemetryProvider {
// REQUIRED. A unique name identifying the telemetry provider.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name must be unique. The name is used to refer to a provider from the telemetry api, so it should also be meaningful.

@mandarjog
Copy link
Copy Markdown
Contributor

Ping! Folks we need to keep this moving.

@douglas-reid
Copy link
Copy Markdown
Contributor Author

Ping! Folks we need to keep this moving.

I'm working on addressing initial comments. Will push an update today.

//
// For resources with a workload-selector, it is only valid to have one resource selecting
// any given workload. If multiple resources select a single resource, the oldest known
// resource will be used to the exclusion of all other resources.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be 'the most specific' ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressing the issue of multiple resources at the same level - there would be no 'most specific'. I'll update the text.

@nrjpoddar
Copy link
Copy Markdown
Member

@douglas-reid are you going to scope this PR down or hide APIs so only the Tracing APIs are exposed? I know we want to target tracing for 1.9 so I'm asking this question.

@mandarjog
Copy link
Copy Markdown
Contributor

We should hide the whole thing, and unhide parts as the impl is checked in.

@douglas-reid
Copy link
Copy Markdown
Contributor Author

@douglas-reid are you going to scope this PR down or hide APIs so only the Tracing APIs are exposed? I know we want to target tracing for 1.9 so I'm asking this question.

Right now, other than the providers bits, this is entirely Tracing oriented (with commented out placeholders for metrics/logs). Should I remove the non-tracing related provider config too?

@nrjpoddar
Copy link
Copy Markdown
Member

@douglas-reid are you going to scope this PR down or hide APIs so only the Tracing APIs are exposed? I know we want to target tracing for 1.9 so I'm asking this question.

Right now, other than the providers bits, this is entirely Tracing oriented (with commented out placeholders for metrics/logs). Should I remove the non-tracing related provider config too?

I'm fine with Mandar's proposal too as long as we only unhide APIs that are implemented. Keeping a lot of unimplemented APIs spanning multiple releases can be a bit of nuisance.

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Nov 20, 2020 via email

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Nov 20, 2020 via email

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Nov 20, 2020 via email

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Nov 20, 2020 via email

Copy link
Copy Markdown
Contributor

@smawson smawson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came here to hit the big green approve button but I see Louis has some lingering comments so I'll let him hit the button :)

Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dougreid this is looking good!

One overall comment:

All the providers seem to have few common fields related to address/port which are copied everywhere. Should we think about creating a type for it?

May be this was explored in conversations above so just me to it instead of rehashing.

@howardjohn
Copy link
Copy Markdown
Member

All the providers seem to have few common fields related to address/port which are copied everywhere. Should we think about creating a type for it?

Proto doesn't have a way to have "embedded" types like golang, so this means that we would need it nested which I don't think adds too much value. While it looks weird in the proto, a user doesn't see that - they just see that all providers have a "address" field which makes intuitive sense. So personally I think the way we have it now makes sense.

@nrjpoddar
Copy link
Copy Markdown
Member

nrjpoddar commented Mar 19, 2021

All the providers seem to have few common fields related to address/port which are copied everywhere. Should we think about creating a type for it?

Proto doesn't have a way to have "embedded" types like golang, so this means that we would need it nested which I don't think adds too much value. While it looks weird in the proto, a user doesn't see that - they just see that all providers have a "address" field which makes intuitive sense. So personally I think the way we have it now makes sense.

I understand Proto limitations. Since these fields collectively describe the transport/cluster settings clubbing them together under a type does not feel odd to me. I just don’t think it’s good API practice to keep fields across growing list of messages consistent by copying/pasting.

Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@douglas-reid is max length supported by all currently listed providers?

// same namespace as the Telemetry policy.
istio.type.v1beta1.WorkloadSelector selector = 1;

// Optional. Tracing defines the per-workload overrides for trace span
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be optional I’m assuming?

Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using per-workload here might be misguided as you can use it to refer to all workloads in namespace or mesh if selectors are omitted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add back in the "Optional" bit later, if that helps. This is anticipating adding in Metrics and AccessLogging (in which case you will not be required to add Tracing configuration).

Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it but I would prefer to make comments mimic the current implementation and not the future.

// will be used.
// NOTE: At the moment, only a single provider can be specified in a given
// Tracing rule.
repeated ProviderRef providers = 2;
Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the behavior if this resource is created in istio-system or root config namespace with no workload selectors and configures a provider? Does that override the MeshConfig setting?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only a prob during transition phase when both exist? I assume the config in mesh config will be eventually removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nrjpoddar it overrides the meshconfig setting for all matches that "fallback" to root config namespace (a result of the full replacement semantics). since the most specific match "wins", we have the following scenarios:

  • mesh config default + root config ns resource with provider = provider from root config ns resource is used in the mesh
  • mesh config default + root config ns resource with provider + workload-specific resource without provider = provider from the mesh config default is used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to not allow provider reference in the root config namespace.

Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current UX

  • mesh config default + root config ns resource with provider = provider from root config ns resource is used in the mesh
  • mesh config default + root config ns resource with provider + workload-specific resource without provider = provider from the mesh config default is used.

is quite confusing I think. @louiscryan @smawson?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely confusing -- and could use refinement.

Other potential issues (more important for metrics/access logging): enabling multiple providers but requiring different configuration (example: send byte distribution metrics to prometheus but not to cloud monitoring). With most specific match winning, there isn't a way to support that. We could make providers a secondary part of the match condition (allowing multiple rules at the same specificity level) - at the cost of more complexity.

Tracing is fundamentally different than metrics/access-logs in this regard (currently). You can't configure two different tracing providers for a single HCM. But we could have multiple metrics-generating filters and access log is a repeated field in HCM.

One way forward may be to separate provider-scoped configuration (say some custom tags are only sent to one tracing backend for this workload) from provider selection. Even still, we'd have to figure out how to handle inheritance of selection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should favor replace-only options, it means if you specify a list of provider, you will replace the stuff above you. If you don't specify then you use walk up the chain. We need to define what are those indivisible messages that are replaced.
Replace semantics on top level Telemetry message results in a confusing UX, so we need to lower it a level.
In Tracing for example, providers and everything else seems like a good division.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated documentation based on conversation recorded in: https://docs.google.com/document/d/1o3rH_C9sWM9Y2xAFfkvlHXEnF6vBR3OfkklHkwjwTAc/edit?resourcekey=0-4e-YSD_7fwtCqQuNnhiNEQ#heading=h.xw1gqgyqs5b

I don't believe that the decision there necessitates any API changes, but rather implementation-side merging.

// same namespace as the Telemetry policy.
istio.type.v1beta1.WorkloadSelector selector = 1;

// Optional. Tracing defines the per-workload overrides for trace span
Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using per-workload here might be misguided as you can use it to refer to all workloads in namespace or mesh if selectors are omitted.

//
// Telemetry configuration does not inherit. Rather, matched configurations must fully-specify
// the desired behavior. Workload-specific configuration fully replaces any configuration
// defined at the local and root namespace levels.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended going forward or just a limitation atm?

Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With DR inheritance and VS referencing we now have a mixed bag of overrides, inheritance and references. @louiscryan @smawson @linsun we should reconcile when one make sense over the other.

Not a blocker for this API though.

// spec:
// selector:
// labels:
// service.istio.io/canonical-name: foo
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be really good to explain if the foo service should also have this label or just the app: foo label?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this label is auto-added for all workloads in istio (without requiring user action). i'd like to avoid trying to document that functionality in example configuration, as it is just an example of a label, if that's ok.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a workload bound configuration. It is explicitly not bound to a service. And “canonical service” gives it exclusivity.

Comment on lines +34 to +35
// For any namespace, including the root configuration namespace, it is only valid
// to have a single workload selector-less Telemetry resource.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure I understand this, are you saying if I apply foo-tracing in bar ns and foo-tracing-alternative also in bar ns (your examples below), it is not allowed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. The examples below aren't meant to be used at the same time, but I can change the ns in one if it helps.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated ns in one example.

@nrjpoddar
Copy link
Copy Markdown
Member

Overall I think with the addition of default provider in MeshConfig we have made a mix of inheritance and override semantics. I’m worried about the case when a Resource defines a provider in root namespace which is different from MeshConfig default.

@douglas-reid
Copy link
Copy Markdown
Contributor Author

@douglas-reid
Copy link
Copy Markdown
Contributor Author

All the providers seem to have few common fields related to address/port which are copied everywhere. Should we think about creating a type for it?

Proto doesn't have a way to have "embedded" types like golang, so this means that we would need it nested which I don't think adds too much value. While it looks weird in the proto, a user doesn't see that - they just see that all providers have a "address" field which makes intuitive sense. So personally I think the way we have it now makes sense.

I understand Proto limitations. Since these fields collectively describe the transport/cluster settings clubbing them together under a type does not feel odd to me. I just don’t think it’s good API practice to keep fields across growing list of messages consistent by copying/pasting.

I think there are plenty of ways to refactor the providers configuration. Perhaps in the follow-on move to separate providers into their own proto files, we can explore ways to expose the common fields? I'm also worried about duplicating fields all over the place.

@nrjpoddar
Copy link
Copy Markdown
Member

Adding do-no-merge label to prevent accidental merge until we figure out "defaulting" semantics.

@nrjpoddar nrjpoddar added the do-not-merge/hold Block automatic merging of a PR. label Mar 19, 2021
@douglas-reid
Copy link
Copy Markdown
Contributor Author

Adding do-no-merge label to prevent accidental merge until we figure out "defaulting" semantics.

@nrjpoddar do we think this is ready now?

Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for working through the various suggestions and feedback. I’m still not convinced how you will refactor provider fields to avoid dull location without requiring a migration for users. But I’m fine merging this if you think there’s a way to scope the duplicated fields and manage it better long term.

Copy link
Copy Markdown
Member

@nrjpoddar nrjpoddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m guessing you will add a separate PR which adds comments to explain merging semantics.

@nrjpoddar nrjpoddar removed the do-not-merge/hold Block automatic merging of a PR. label Mar 26, 2021
@istio-testing istio-testing merged commit 68cba41 into istio:master Mar 26, 2021
zhlsunshine pushed a commit to zhlistio/api that referenced this pull request Jul 8, 2021
…m tags (istio#1740)

* Add initial Telemetry API definition

* fix names and comments

* Rename TracingConfig to Tracing

* Remove TelemetryProviders and consolidate to just Tracing providers

* add new extension providers to oneof

* Remove providers from telemetry API

* Add release note

* Address comments

* Add back providers, remove match, simplify Trace API

* Collapse TracingRule into Tracing and remove deprecations in ProxyConfig

* Move from address to service + port in providers

* Remove exclude_mesh_tags

* Revert to boolean control of span reporting

* Cleanup documentation

* Replace subdomain with telemetry type
zhlsunshine pushed a commit to zhlistio/api that referenced this pull request Jul 30, 2021
…m tags (istio#1740)

* Add initial Telemetry API definition

* fix names and comments

* Rename TracingConfig to Tracing

* Remove TelemetryProviders and consolidate to just Tracing providers

* add new extension providers to oneof

* Remove providers from telemetry API

* Add release note

* Address comments

* Add back providers, remove match, simplify Trace API

* Collapse TracingRule into Tracing and remove deprecations in ProxyConfig

* Move from address to service + port in providers

* Remove exclude_mesh_tags

* Revert to boolean control of span reporting

* Cleanup documentation

* Replace subdomain with telemetry type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. lifecycle/automatically-closed Indicates a PR or issue that has been closed automatically. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.