reporter: extract container.ID from cgroupv2 path by florianl · Pull Request #548 · open-telemetry/opentelemetry-ebpf-profiler

florianl · 2025-06-25T08:58:42Z

This is a follow up to #535.

While the tracer part of the project reports the cgroup v2 path for each sample, the reporter is expected to report the container ID. The container ID then can be used to associate the sample to more detailed resource information.

In the context of OTel collector, the container ID then can be used like this:

  k8sattributes:
    auth_type: "serviceAccount"
    passthrough: false
    filter:
      node_from_env_var: KUBERNETES_NODE_NAME
    extract:
      metadata:
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.namespace.name
        - k8s.node.name
        - k8s.pod.start_time
        - service.namespace
        - service.name
        - service.version
        - service.instance.id
      labels:
        - tag_name: app.label.component
          key: app.kubernetes.io/component
          from: pod
      otel_annotations: true
    pod_association:
      - sources:
          - from: resource_attribute
            name: container.id

As the cgroupv2 path contains further information, that could be beneficial for other reporters, the extraction of the container ID happens in the reporter.

This is a follow up to #535. While the tracer part of the project reports the cgroup v2 path for each sample, the reporter is expected to report the container ID. The container ID then can be used to associate the sample to more detailed resource information. In the context of OTel collector, the container ID then can be used like this: ``` k8sattributes: auth_type: "serviceAccount" passthrough: false filter: node_from_env_var: KUBERNETES_NODE_NAME extract: metadata: - k8s.pod.name - k8s.pod.uid - k8s.deployment.name - k8s.namespace.name - k8s.node.name - k8s.pod.start_time - service.namespace - service.name - service.version - service.instance.id labels: - tag_name: app.label.component key: app.kubernetes.io/component from: pod otel_annotations: true pod_association: - sources: - from: resource_attribute name: container.id ``` As the cgroupv2 path contains further information, that could be beneficial for other reporters, the extraction of the container ID happens in the reporter. Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

florianl · 2025-06-25T13:41:16Z

-)
-
-// LookupCgroupv2 returns the cgroupv2 ID for pid.
-func LookupCgroupv2(cgrouplru *lru.SyncedLRU[PID, string], pid PID) (string, error) {


Moved the functionality of this function to reporter/util.go and made it a package private function.

christos68k · 2025-06-26T20:12:10Z

 	}
 	// Set a lifetime to reduce the risk of invalid data in case of PID reuse.
-	cgroupv2ID.SetLifetime(90 * time.Second)
+	pidToContainerID.SetLifetime(90 * time.Second)


Not for this PR, but we can and should do better here (e.g. ProcessManager pub-sub scheme where various subsystems can get notified about a PID exit) since the ProcessManager will receive notifications for every exited PID.

christos68k · 2025-06-26T20:42:46Z

+)
+
+var (
+	cgroupv2ContainerIDPattern = regexp.MustCompile(`0:.*?:.*?([0-9a-fA-F]{64})(?:\.scope)?$`)


We can either relax the regexp like shown below (I'm not sure if .scope is always present so I took it out) or make the matching more fine-grained where we'd first match on the general /proc/pid/cgroup format (0:.*?:(.*), as specified here) and then match again on specific container runtimes that we support.

Suggested change

cgroupv2ContainerIDPattern = regexp.MustCompile(`0:.*?:.*?([0-9a-fA-F]{64})(?:\.scope)?$`)

cgroupv2ContainerIDPattern = regexp.MustCompile(`0:.*?:.*?([0-9a-fA-F]{64})`)

Fixed with daa9dc6.

I didn't apply the suggested regex, as it would match the pod ID rather than the container ID. For the container ID we need to use the last occurrence of the 64 character hex ID. With the suggested regex it is possible to match the first 64 character hex ID, which is the pod ID.

I went with a single regex, as I expected push back if applying two regexs in sequential order.

Replaced the return statement with a log statement in 7d42302

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

christos68k · 2025-07-02T01:39:05Z

 	if err != nil {
-		log.Debugf("Failed to get a cgroupv2 ID as container ID for PID %d: %v",
-			meta.PID, err)
+		return err


Now we're losing traces on lookupContainerID errors (e.g. if a PID dies). If we continue we'd be reporting traces without container ID.

Thinking about this some more, it might be better to get rid of this LRU completely and fetch the container ID when we first parse a PID in updatePIDInformation.

We'd then store the container ID inside ProcessMeta in the process registry ProcessManager.pidToProcessInfo. The wins are two-fold:

No need for repeat wasteful container ID extractions on expiration, a container ID will only be fetched once for any tracked PID.

We don't throw away traces and there are no race conditions with PID exits.

I agree with this earlier #548 (comment), that changes to the API of reporter to bring it closer to ProcessManager, should be a separate change.

My earlier comment would only replace a small part of this PR. With my current comment, I propose replacing this entire PR with something else (we'd keep the containerID extraction logic but nothing else). If you agree that this is the way to go, then it's faster to just do that instead of merging this PR and then creating another PR that will remove most of it.

Also I don't propose any changes to the reporter API, we just need to add an extra field to samples.TraceEventMeta.

I opened #577, but didn't have time to test. If everything is OK, you can cherry-pick the last commit into this PR (or close this PR and review #577).

I don't think, #577 should be the way forward. Profiling capabilities should not be mixed with such functionality.
This PR introduces a basic support for container IDs. But there are use cases, where pod ID, cgroupv1 data or other information is relevant to the reporter.

Therefore, I have the following suggestion:

merge this PR

introduce a generic hook in processmanager, so it can be configured to extract any relevant information and report this via TraceEventMeta.

To me processmanager should focus on properly handling processes rather than extracting every possible process data.

I agree that processmanager should have more plugin/hook based system to collect the meta data.

After 15min brief look, I would prefer #577 as first step. With the following rationale:

reporter indicates it just reports things, this is making it collect data too

reporter might be too late to collect the container ID in some short lived process cases, where as processmanager would be more suitable to do it

reporter needs extra LRUs with memory overhead (and CPU for cleaning it?); processmanager can stash the information in existing data structures

processmanager already collects and caches process meta data

container ID is also such piece of data that other reporters likely want it

everytime we have LRUs containg amending data, there is possibility for out of sync where the LRU no longer holds the data we still need (or alternatively it needs to be excessively large and becomes memory hog)

Additionally the proposal is to add processmanager hooks this indicates that the natural place to collect this information is in context of processmanager.

To me just collecting the metadata in processmanager makes perfect sense. It would be later step to convert it to more plugin based approach. But even then the processmanager should contain the data in attributes or similar. The reasoning is same as #384 for symbol data. The data collection should happen early.

To me processmanager should focus on properly handling processes rather than extracting every possible process data.

Processmanager is responsible for the process metadata also, and this is such data. #384 proposes to make processmanager also responsible for the symbols. I think processmanager is the central piece for everything. I suppose the fear is that it becomes too monolithic? So agreeably it needs to be more plugin based in the future.

But I think also that containerID is pretty commonly wanted thing - it is a pretty fundamental piece of process metadata. I would not try to abstract it away.

So I propose the processmanager role should be extended to include this piece of the process metadata.

I agree with @fabled, I think there's no good reason to go with an LRU now that I've written the code in #577, it's a semantically wrong solution, open to race conditions and less performant (it keeps expiring and extracting container IDs for each PID). In addition, we already extract race-sensitive process metadata in ProcessManager so it's ugly from a consistency POV to split this logic and do that in the reporter too.

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

florianl · 2025-07-03T13:45:05Z

Closing in favor of #577

florianl requested review from a team as code owners June 25, 2025 08:58

korniltsev reviewed Jun 25, 2025

View reviewed changes

Comment thread reporter/base_reporter.go Outdated

florianl added 3 commits June 25, 2025 12:54

fixup: add small LRU to cache regex result

7d0c039

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

fixup: merge reporter caches into a single cache

aa21cdb

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

fixup: function comment

877544e

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

florianl commented Jun 25, 2025

View reviewed changes

christos68k reviewed Jun 26, 2025

View reviewed changes

fabled reviewed Jun 27, 2025

View reviewed changes

Comment thread reporter/util_test.go Outdated

florianl mentioned this pull request Jun 27, 2025

Making it easier to populate Service ID #547

Closed

fixup: update regex

daa9dc6

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

florianl force-pushed the reporter-containerID branch from 2ced1d4 to daa9dc6 Compare June 30, 2025 08:55

fixup: simplify test

55426a6

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

christos68k reviewed Jul 2, 2025

View reviewed changes

fixup: just log instead of return

7d42302

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

christos68k mentioned this pull request Jul 3, 2025

ProcessManager: Extract containerID with the rest of process metadata #577

Merged

florianl mentioned this pull request Jul 3, 2025

k8s discovery #581

Closed

florianl closed this Jul 3, 2025

florianl deleted the reporter-containerID branch November 4, 2025 07:37

	cgroupv2ContainerIDPattern = regexp.MustCompile(`0:.?:.?([0-9a-fA-F]{64})(?:\.scope)?$`)
	cgroupv2ContainerIDPattern = regexp.MustCompile(`0:.?:.?([0-9a-fA-F]{64})`)

Conversation

florianl commented Jun 25, 2025

Uh oh!

Uh oh!

florianl Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christos68k Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

christos68k Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florianl Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florianl Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

christos68k Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florianl Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christos68k Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christos68k Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

florianl Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fabled Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

christos68k Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

florianl commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

florianl Jun 25, 2025 •

edited

Loading

christos68k Jun 26, 2025 •

edited

Loading

christos68k Jun 26, 2025 •

edited

Loading

florianl Jun 30, 2025 •

edited

Loading

christos68k Jul 2, 2025 •

edited

Loading

florianl Jul 2, 2025 •

edited

Loading

christos68k Jul 3, 2025 •

edited

Loading

florianl Jul 3, 2025 •

edited

Loading

christos68k Jul 3, 2025 •

edited

Loading