Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTEP: Define ResourceProvider #4316
base: main
Are you sure you want to change the base?
OTEP: Define ResourceProvider #4316
Changes from 2 commits
a830655
c8d5301
a3cb621
2fd655c
a1d4461
844ffa8
cc9877f
3cd7641
7e24b05
6e1d24d
3d01a5f
47d549e
b134f1b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So, if you read the document top down and have no idea what this is about: Until one line above, this was a high-level description of why the current resource handling is not really fit for reality. And now you introduce a function. I mean where? The step you took here is far too far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, regarding the details: Why would you even define that it is a function? In OOP languages, it might be a class, abstract class or interface, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanations and details provided can definitely be better, but please have a look at the OTEP template to understand the meaning of each section.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/0000-template.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would it hold a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OnChange listeners later in this document. Provide a link and reorder to resolve this conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the resource owning SDK or is this something different? In the owning SDK, I would not see why this is required. If you have a use case, please provide one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the explanation to include a description of loose coupling as we practice it in OpenTelemetry. Let me know if it helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are such implementation comments really required? Wouldn't this also imply that the contract for ResourceListener should be extended such that it cannot assume to be executed by a specific thread? I.e. it cannot make use of thread local storage etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we want to ensure that we understand the basic details of at least one reasonable implementation, otherwise the SDK maintainers who read the spec will be left with a number of open questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add links; I do not really understand what you want here.
Also, that sounds to me as as if a lot of implementations would break since this cannot be done gracefully in languages lacking proper overload resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are allowed to issue new major versions of the SDK constructors, but it is best if we can avoid it. Implementations lacking overload are architected to use other forms of extensibility to allow for additional config parameters, so that part is not really a problem. But we have identified that the Go SDK may break for other reasons, and we need to be careful.
If you're asking about Providers in general, I would suggest reading the tracing and logging portion of the spec, it will give you a good idea of the patterns that we use in the SDK design. Looking at metrics is also a good idea, but it's a bit more complicated than the other signals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is so fundamental that I would not expect it to be compatible in any way. Hence, I will not even start discussing your use case below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the perspective of the data model and the wire protocol, it is a hard requirement that this change is 100% compatible. That is way any proposal must preserve the Resource section of the data model in a way that allows older systems to continue to work.
When we look at compatibility in Otel, there are three specific areas:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean this will be in a
v2
of the SDK given this is backwards incompatible?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe it would, especially if all of the details about provider setup are exposed (which they are in v1). In a world where users have a config file and a NewSDK constructor that encapsulates all of these details, I think it would break far fewer users. But it would still be a major version bump, I don't see how it couldn't be.
In general, it seems like switching to entities would change how resources are detected, and I imagine that alone would probably create a breaking change to SDK setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had been goinng through contortions to try to avoid a version bump.
If we think it's easier to just bump the sdk version - we need to discuss that with sdk maintainers but it would give us a lot of options in API design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Go may be in a special situation here.
Breaking the SDK is not great, it would be better if there is overlap and a grace period rather than a hard break.
For reference, this is the upgrade strategy we use, which depends on users being able to push to the latest version of the SDK. We don't want to create a situation where maintainers have to maintain two branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus target_info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus target_info makes a hash of the entire resource object? It doesn't have a list of specific identifying attributes that it uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it uses service name/namespace/instance.id and is, hence, equally not compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. How would adding a transient attribute such as
session.id
oruser.name
to the resource object affect target_info?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way target_info is currently implemented requires that the set of ressource attributes does not change. If some software relies on that property and you change it, that change breaks that software.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I checked with the Prometheus maintainers about target_info. They say it only uses a small number of specific attributes such as
service.name
, etc. That makes target_info a good example of something that would not be affected by the changes being proposed, as those keys and values would not be perturbed in any way.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not make such statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where these changes break the "immutability" of resources is a question the community continuously asks for clarification on, so it is important that we address it here. But I'm sure that this section could be phrased better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't get anything from immutability if you, in turn, drown the callers in a stream of immutable objects; the result is inconsistency and inefficiency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but that's actually why using the term "immutable" in the spec was never a very accurate description. It only meant "immutable" from the perspective of the local object in code. Not being specific about how we use that word has definitely led to a lot of confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is service.id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops, that should be
service.instance.id
. It is an example of a stable identifying resource. I will come back and do another pass to add links to this and other places where we reference the spec and semantic conventions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume pending acceptance of this OTEP we'd begin experimenting here?
This seems like the biggest point to address ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this relevant at all? Spans are produced after the end of activity and the snapshot is taken then. Identifying attributes need to remain stable anyway, so the change of descriptive attributes should be know by the backend and the backend can provide adequate information where required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to prototype extensively before accepting this OTEP. Like all OTEPs, it should not be accepted until we can link to working examples that illustrate how we plan to resolve these concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naive implementation could lead to a degenerate case, for example, on mobile, where something like a network connection quickly oscillates between connected and unconnected. This would effectively nerf the batching, cutting a new batch a few times a second (which is big in mobile).
Whether this case needs to be handled well, or simply called out so that implementations can protect itself from it, is something we should address. I don't think a that we need so solve this well at this point for us to move this forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we should learn from this observation is that descriptive attributes cannot be sent with the data but must be sent through another channel instead. IIRC, there already is a proposal for such a channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, reaching this point, I no longer like the network example as it seems to show that it should not have been a resource attribute in the first place. I'd rather use service instances migrating between hardware or threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SO regarding flagellations + descriptive attributes - I think we need to encourage (but not force) users not to have descriptive attributes in RESOURCE, but allow them in ENTITY. Additionally, we should prefer NOT to identify source of telemetry with something that is volatile. These concerns should be addressed in how we model entity types within Semantic Conventions.
We discussed this a bit in person, but effectively things that MAY be stable in a server-side context MAY NOT be stable in a mobile context (or less stable). So choice of which "entities" to use may need to be localized to an application/service deployment. OTEL needs to be flexible enough to support these both. What we should NOT do is prevent valid identities on Mobile because they don't work in servers or vice-versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but let's be careful to not overcomplicate things with a lot of philosophy! We run into a lot of trouble and bike shedding when we lose track of the practical usage of this data.
Resources are labels applied to all of the data in a batch. Entities are groups of labels. EntityStates are a timeline of changes with precise timestamps as to when the change occurred.
From that perspective, we can ask the question: if a resource such as the networking stack thrashes, or even changes once in the middle of an operation, what label (or labels) is the most helpful to put on that batch of data? And at what granularity should we be segmenting batches of data? The fine grained history is in the EntityState stream. Anything in resources is going to be very coarse grained. So we need to pick a course-graining strategy, based on how we intend backends to make use of the resource labels as a practical manner – not based on a philosophical theory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, perhaps some entities can be marked as "volatile" in a way that causes them to stack up in the same batch rather than trigger a flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw. is there any prior art for a OTel-like resource concept in observability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a few things.
From OpenCensus - https://github.com/census-instrumentation/opencensus-specs/tree/master/resource
From Stackdriver (Now Google Cloud Observability) - https://cloud.google.com/monitoring/api/resources
There may be others. I have a write-up (in Entities WG notes document) about lessons learned from Google Cloud's resource design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most proprietary APM products have something like this in them, but they are closed source so I can't point at the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"permanent" is introduced in this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here, the OnChange function is provided by some third-party class and the SetAttribute is used to propagate changes. Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is locking relevant here? it seems to be the current focus topic; I agree that it is important, but it isn't the most important topic, right? Using the API and migration should be the most important topics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really understand the pseudo code syntax. It seems to be Go with classes and some strange extra syntax that I do not get. Also, explicit this is really uncommon and only used in languages that require it due to bad language design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make this comment again - The syntax/language for pseudo-code isn't important here.
Are you able to understand what the goal of the interface is from the description and the example? If so, let's evaluate that, not choice of pseudo-code syntax.
If you have specific things you don't understand, list them so they can be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add that regardless of the pseudo-code someone chooses to use, we request that the examples to be 100% explicit and not assume that the reader knows any implicit details about a particular programming language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, plus the statement above is not really correct since the listener could just enqueue a task in a thread pool. I'm not sure why this part should be specified here. It isn't even required in languages/runtimes that do not have threads or commonly don't use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes when it comes to language-specific examples it's best to look at prototypes, not pseudo-code. The pseudo-code is helpful to show that there is at least one way to implement the proposed changed in a multi-threaded language. It shouldn't be considered the only way or even the best way, but just something to allow us to discuss potential issues with the design that implementations may need to think about.