Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add k8s.container.restart_count Resource attribute #1945

Merged

Conversation

dmitryax
Copy link
Member

This change adds a Resource attribute to represent number of container restarts. This is can be used in k8s logs collection to identify a particular container instance, where the number of container restarts provided as a part of log file path.

Related issues #

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please justify why this is not a "metric". Seems to be a metric not a resource, even the name suggest that it is a "counter" :)

@dmitryax
Copy link
Member Author

dmitryax commented Sep 21, 2021

@bogdandrutu it's not a metric of a k8s pod or something like that, it's rather a metadata/identifier of particular container instance which is a resource itself I believe. It can be also called container.instance_number or container.run_number, but in k8s API for ContainerStatus it's referred as RestartCount, and I think restart_count more intuitive to understand than other options.

In k8s logs collection, we want to use the following set of attributes to identify a particular container: (k8s.pod.uid, container.name, container.restart_count) which we can use to fetch other container attributes like container.id etc

@tigrannajaryan
Copy link
Member

I think container.restart_count may work better as a span/log record attribute. It looks a bit strange as a Resource attribute. Or maybe we should give a different name so that it is clear that it is a reasonable non-identifying property of the Resource.

I agree with @bogdandrutu that it can also be a metric, but right now the source of this data is a log file, so to capture it as a metric we are essentially talking about signal conversion, which is a valid approach, but it does not prohibit us from logging this data as a log record attribute.

@dmitryax
Copy link
Member Author

@tigrannajaryan thanks for replying.

I think container.restart_count may work better as a span/log record attribute.

If a container is a source of logs/traces/metrics, it does seem to be a resource. Why do you think span/log record attribute will be better place? Meaning of this attribute is very similar to container.id which is a resource attribute. The main different is that container.id is a unique attribute, but this one is unique only within particular container spec (e.g. pod->containerName in k8s).

Or maybe we should give a different name so that it is clear that it is a reasonable non-identifying property of the Resource.

We can call it container.instance_number, container.run_number or container.run_attempt. But those names probably are be not very descriptive. Please let me know if you have any other suggestions

@tigrannajaryan
Copy link
Member

We don't have this formally defined anywhere in Otel but it is clear that some Resource attributes identify the Resource and some others are non-identifying. This is obviously a non-identifying attribute of a Resource (unless it is somehow possible for different runs to co-exist, which I think is not possible since this is about restarts).

I think the name is what causes a bit of a confusion and friction here. However if we are certain that this attribute indeed is associated with the Resource (Container instance) and does not change during the lifetime of the Resource then it needs to be a Resource attribute. I think it is true in this case so we should accept the convention.

@tigrannajaryan
Copy link
Member

To be clear: I can't think of a better name.

@dmitryax
Copy link
Member Author

dmitryax commented Sep 22, 2021

This is obviously a non-identifying attribute of a Resource (unless it is somehow possible for different runs to co-exist, which I think is not possible since this is about restarts).

I don't think this is only about restarts. I took the name from k8s API where it is referred as RestartCount, but it actually seems to be an identifying resource attribute for a particular container instance. From a container orchestrator perspective, there can be other containers with different container.id and different container.restart_count from the same container spec, but they all must be in stopped/terminated state, and only one of them with the highest container.restart_count number can be in running stare at any point in time.

I think the name is what causes a bit of a confusion and friction here. However if we are certain that this attribute indeed is associated with the Resource (Container instance) and does not change during the lifetime of the Resource then it needs to be a Resource attribute.

Yes, I believe this attribute does not changed during the lifetime of a container as a Resource. Once container gets restarted (container.restart_count incremented), it's already another container with different container.id. I think the name is still debatable, it can be something more like an identifier than counter, but I believe it should be a resource attribute.

@tigrannajaryan
Copy link
Member

From a container orchestrator perspective, there can be other containers with different container.id and different container.restart_count from the same container spec, but they all must be in stopped/terminated state, and only one of them with the highest container.restart_count number can be in running stare at any point in time.

I see, this is important. In that case I think a name like container.instance_number or container.run_number may be more appropriate.

If you could find other (non-K8s) container orchestration systems which exhibit a similar behavior and see what the call this concept it may be easier to arrive at the right name that is more universal (since container.* namespace is not limited to k8s).

@Oberon00
Copy link
Member

Oberon00 commented Sep 23, 2021

This is obviously a non-identifying attribute of a Resource

I disagreee. I think the restart ID or whatever we want to call it is even required if you want to retrieve the logs of a particular container restart cycle (e.g. to find out why it restarted).

(unless it is somehow possible for different runs to co-exist, which I think is not possible since this is about restarts).

I don't think this has much to do with it. A timestamp could in theory be used together with a POD ID + container name as an alternative to the restart ID. But how does this make the restart ID non-identifying? Is the timestamp an identifying attribute? We don't even have a "process start time" or "container start time" attribute on the resource (though that would maybe make sense). Also, when it comes to timestamps very close to restarts, it might depend on the clock from which they came to which restart cycle they would map due to inexactness.

@Oberon00 Oberon00 added area:semantic-conventions Related to semantic conventions spec:resource Related to the specification/resource directory labels Sep 23, 2021
@tigrannajaryan
Copy link
Member

I disagreee. I think the restart ID or whatever we want to call it is even required if you want to retrieve the logs of a particular container restart cycle (e.g. to find out why it restarted).

I did not say it is unnecessary data. :-) I said it is not required to uniquely identify the Container Instance (and perhaps I am wrong). Since there can ever be only one instance of that container instance the restart counter is superfluous for the purposes of identifying which container instance it was. It is of course not superfluous as you correctly point out for the purposes of understanding how many times the instance restarted and which "numbered start" of the instance the particular log belongs.

So, I guess it depends on how we define the Container Instance entity. Is it the same instance if it restarts or a different instance? Depending on that the restart counter will be either a non-identifying or an identifying attribute of the instance. We can debate about this (I am not sure what's right approach), but I think for the purposes of this particular PR it does not matter, since we agree that one way or another the "restart counter" is an attribute of the Resource.

@dmitryax
Copy link
Member Author

dmitryax commented Sep 27, 2021

I didn't find any insights on a name for this attribute looking at other container orchestrators.

Some other options that come to my mind are container.start_number, container.start_index, container.run_index, container.index

@tigrannajaryan
Copy link
Member

@bogdandrutu are you OK with the explanation in this thread of why this is not a metric?

dmitryax added a commit to dmitryax/opentelemetry-collector-contrib that referenced this pull request Sep 28, 2021
This change provides an option to fetch container metadata from k8s API in addition to k8s pod metadata. The following attributes now can be automatically added by the k8sattributes processor:
- container.image.name
- container.image.tag
- container.id

`container.image.name` and `container.image.tag` require additional container identifier present in resource attributes: `container.name`.

`container.id` requires additional container run identifiers present in resource attributes: `container.name` and `run_id`.

`run_id` identified is a subject to change, see open-telemetry/opentelemetry-specification#1945
@bogdandrutu bogdandrutu dismissed their stale review September 28, 2021 15:21

Please add the explanation why the restart_count needs to identify the "running" instance of a container.

bogdandrutu pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Sep 28, 2021
* [k8sattributes processor] Add optional container metadata

This change provides an option to fetch container metadata from k8s API in addition to k8s pod metadata. The following attributes now can be automatically added by the k8sattributes processor:
- container.image.name
- container.image.tag
- container.id

`container.image.name` and `container.image.tag` require additional container identifier present in resource attributes: `container.name`.

`container.id` requires additional container run identifiers present in resource attributes: `container.name` and `run_id`.

`run_id` identified is a subject to change, see open-telemetry/opentelemetry-specification#1945

* Make linter happy

* Make container attributes enabled by default
@dmitryax
Copy link
Member Author

@bogdandrutu I added additional description to the attribute. Please let me know if it works, or you want something other than that?

@dmitryax
Copy link
Member Author

dmitryax commented Sep 29, 2021

@tigrannajaryan @Oberon00 what do you think about the name?

Maybe something like container.start_number or container.run_number would better represent it as an identifier?

Or we can keep container.restart_count?

@Oberon00
Copy link
Member

I think restart_count works, and is the only existing name for the concept I could find.

I just wonder: If this is actually a k8s-specific concept, should the attribute be moved to the k8s semantic conventions, e.g. k8s.container.restart_count (the k8s.container group already exists in the semantic conventions, though I have no idea what the difference between container.name and k8s.container.name would be)

@bogdandrutu
Copy link
Member

@Oberon00 container.name and k8s.container.name I remember that k8s has a way to configure a different name for the container (I think is more or less an alias), but I may be wrong. @dmitryax I think you know about that, may worth clarifying it as well.

In regard to restart_count, is it available in Docker? I found this moby/moby#25859 which seems to suggest that docker also has a concept of "restarts".

@tigrannajaryan
Copy link
Member

+1 for restart_count and double-checking if it is k8s-specific or no.

@dmitryax
Copy link
Member Author

@bogdandrutu thanks for posting the link.

Looks like docker engine uses the same terminology, but the meaning is a bit different. RestartCount is non-identifying attribute of the container in docker engine, for example ContainerID stays the same between restarts.

From the other hand, when it's restarted by k8s, the container is different, it gets another ContainerID.

Given that, I think we should use k8s.container.restart_count name for this particular attribute to avoid confusion with the RestartCount used by docker engine.

@Oberon00 @tigrannajaryan @bogdandrutu I'm going to move it the k8s namespace if you agree.

@tigrannajaryan
Copy link
Member

Looks like docker engine uses the same terminology, but the meaning is a bit different. RestartCount is non-identifying attribute of the container in docker engine, for example ContainerID stays the same between restarts.

From the other hand, when it's restarted by k8s, the container is different, it gets another ContainerID.

@dmitryax Is this is a significant difference that warrants that the counter is placed in k8s namespace specifically? If in the future we decide that we want to also have convention for docker's restart count will we introduce docker.container.restart_count and if we do what is the benefit of it being different from k8s.container.restart_count?

Do we expect that docker.container.restart_count and k8s.container.restart_count may contain different values (e.g. one is a decimal number and the other is hex number) which makes impossible to put them in one attributed named container.restart_count?

If there are such differences then yes let's make it a separate attribute for k8s.

However, if the only difference is the behavior of whether container id changes or no when restart_count is incremented then I am not sure why it matters for observability purposes. Perhaps it matters but I am failing to see it, please clarify.

@dmitryax
Copy link
Member Author

dmitryax commented Oct 1, 2021

@dmitryax Is this is a significant difference that warrants that the counter is placed in k8s namespace specifically? If in the future we decide that we want to also have convention for docker's restart count will we introduce docker.container.restart_count and if we do what is the benefit of it being different from k8s.container.restart_count?

I don't think the difference is significant, but they do have different meaning. My concern is that container.restart_count is more likely will be associated with docker, as a system that is much closer to containers, than k8s. But I'm not sure if that is a big problem preventing us from using container.restart_count for both concepts.

Do we expect that docker.container.restart_count and k8s.container.restart_count may contain different values (e.g. one is a decimal number and the other is hex number) which makes impossible to put them in one attributed named container.restart_count?

No, they are both incremental decimal numbers.

However, if the only difference is the behavior of whether container id changes or no when restart_count is incremented then I am not sure why it matters for observability purposes. Perhaps it matters but I am failing to see it, please clarify.

Yes, this is the only difference that I observed: k8s restarts containers by replacing them, docker reuses existing containers. And I don't think that there is a use case when both approaches applicable at the same time, as opposed to container.name/k8s.container.name use case. So probably we can reuse container.restart_count for both approaches as well. I would need to update the description in that case.

@Oberon00
Copy link
Member

Oberon00 commented Oct 1, 2021

I am leaning towards having separate attributes. From a theoretical (!) standpoint, if I want to get the logs of a container at a certain restart cycle I would have to know either: container.name + container.restart_count (since different k8s-level container restarts have different container.names) or k8s.pod.name + k8s.container.name + k8s.container.restart_count + container.restart_count (for the theoretical case that a container is restarted by the runtime within a pod).

OK, k8s does not use any container runtime-provided restart mechanism today. But what if we have another orchestrator that does? Or k8s starts using it?

Even now the semantics seem to be:

container.restart_count: Number of previous restarts for this container as identified by container.name, except if using k8s, then number of restarts of the container as identified by k8s.pod.name + k8s.container.name.

I don't like "except if" semantics much, if we can avoid them.

@tigrannajaryan
Copy link
Member

It seems like it is slightly safer and more future proof if we keep the attribute specific to k8s, since we know the exact semantics of it while not sure (can only speculate) if other orchestration runtimes will have a similar attribute.
Let's go with k8s.container.restart_count.

This change adds a Resource attribute to represent number of container restarts in kubernetes. This is can be used in k8s logs collection to identify a particular container instance, where the number of container restarts is a part of a log file path.
@dmitryax dmitryax force-pushed the add-container-run-number branch from 8f384e7 to 36f44c9 Compare October 1, 2021 16:18
@dmitryax dmitryax changed the title Add container.restart_count Resource attribute Add k8s.container.restart_count Resource attribute Oct 1, 2021
@dmitryax
Copy link
Member Author

dmitryax commented Oct 1, 2021

Moved to k8s namespace

dmitryax added a commit to dmitryax/opentelemetry-collector-contrib that referenced this pull request Oct 3, 2021
Change name of the attribute according to agreement in open-telemetry/opentelemetry-specification#1945. It's better if the actual change  can be merged before the next release to avoid breaking changes in k8sattributes processor.
@tigrannajaryan tigrannajaryan requested a review from a team October 3, 2021 07:49
semantic_conventions/resource/k8s.yaml Show resolved Hide resolved
semantic_conventions/resource/k8s.yaml Show resolved Hide resolved
@dashpole
Copy link
Contributor

dashpole commented Oct 5, 2021

container.name and k8s.container.name are the same for all container runtimes except docker when using kubernetes. The name difference comes from the "dockershim", which makes docker implement the CRI. Dockershim is also deprecated, and will be removed from upstream in the next 6 months. So the name difference isn't really k8s vs others, but docker vs others, and will become less common.

But assuming we are using separate names for container runtime name vs kubernetes name, I think it is correct to separate the restart count as well, since it is the number of times the container has restarted for each name.

@tigrannajaryan
Copy link
Member

I think we have enough approvals, but let's keep this open for another day before merging.

@tigrannajaryan tigrannajaryan enabled auto-merge (squash) October 6, 2021 12:08
tigrannajaryan pushed a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Oct 6, 2021
…t` (#5572)

Change name of the attribute according to agreement in open-telemetry/opentelemetry-specification#1945. It's better if the actual change  can be merged before the next release to avoid breaking changes in k8sattributes processor.
@tigrannajaryan tigrannajaryan merged commit 8ebafd7 into open-telemetry:main Oct 6, 2021
@dmitryax dmitryax deleted the add-container-run-number branch October 21, 2021 00:50
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024
…1945)

This change adds a Resource attribute to represent number of container restarts in kubernetes. This is can be used in k8s logs collection to identify a particular container instance, where the number of container restarts is a part of a log file path.

Co-authored-by: Tigran Najaryan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions spec:resource Related to the specification/resource directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants