Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specs for host.id and profiler registration message #853

Merged
merged 5 commits into from
Mar 27, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions specs/agents/metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The process for proposing new metadata fields is detailed
System metadata relates to the host/container in which the service being monitored is running:

- hostname
- host.id
- architecture
- operating system
- container ID
Expand Down Expand Up @@ -75,6 +76,13 @@ hostname if `configured_hostname` is not provided.
Agents that are APM-Server-version-aware, or that are compatible only with versions >= 7.4, should
use the new fields wherever applicable.

#### Host.id

APM agents MAY collect the `host.id` as an unique identifier for the host.
If they collect it, it MUST be conformant to the [OpenTelemetry SemConv for `host.id`](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/).

If the APM agent performs correlation of its spans/transactions with universal profiling data, it MUST send the `host.id` (see the [profiling integration spec](universal-profiling-integration.md#profiler-registration-message)) as part of the metadata. The APM agent MAY solely rely on the `host.id` provided by the profiling host agent in that case.

#### Container/Kubernetes metadata

On Linux, the container ID and some of the Kubernetes metadata can be extracted by parsing `/proc/self/cgroup`. For each line in the file, we split the line according to the format "hierarchy-ID:controller-list:cgroup-path", extracting the "cgroup-path" part. We then attempt to extract information according to the following algorithm:
Expand Down
25 changes: 20 additions & 5 deletions specs/agents/universal-profiling-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ transaction-id | uint8[8]
* *span-id*: The W3C trace id of the currently active span
* *transaction-id*: The W3C span id of the currently active transaction (=the local root span)

APM-agents MAY start populating the thread-local storage only after receiving a host agent [registration message](#profiler-registration-message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Do we have use-cases where the agent should populate this TLS without waiting for the registration message ? If so, then a "SHOULD" sounds more appropriate as updating this TLS seems useless if the profiler is not available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case would be if you collect traces for the application startup and what to have profiling data for those at the very beginning:
There will be a short delay between application startup and receiving this initial registration message. As a result, spans started and activated before that receival would not get correlated.
This can be overcome for those edge cases by eagerly populating the TLS, even if it is not known whether a profiler is already there.

I'm planning to implement this by having a tri-state enabled config option:

  • false: No correlation, the native lib won't even be loaded
  • true: Correlation active, TLS will be eagerly populated
  • auto (default for OTel-extension): Correlation will be active, TLS will only be populated after receiving the profiler registration message (So basically close to zero overhead if no profiler is active)


### Concurrency-safe Updates

The profiler might interrupt a thread and take a profiling sample while that thread is in the process of updating the contents of the shared thread local storage. Fortunately, we have the following guarantees about this interruption:
Expand Down Expand Up @@ -136,10 +138,6 @@ And here how to read a messages in a non-blocking way:

```c
size_t readProfilerSocketMessages(uint8_t* outputBuffer, size_t bufferSize) {
if(profilerSocket == -1) {
return raiseExceptionAndReturn(jniEnv, -1, "No profiler socket active!");
}

int n = recv(profilerSocket, outputBuffer, bufferSize, 0);
if (n == -1) {
if(errno == EAGAIN || errno == EWOULDBLOCK) {
Expand Down Expand Up @@ -173,6 +171,23 @@ All messages have the following layout:
* *message-type* : An ID uniquely identifying the type (and therefore payload structure) of the message.
* *minor-version* : The version number for the given *message-type*. This value is incremented when new fields are added to the payload while preserving the *message-type* (non breaking changes). For breaking changes a new *message-type* must be used.

## Profiler Registration Message

Whenever the profiling host agent starts communicating for the first time with a process running an APM Agent, it MUST send this message.
This message is used to let the APM-agent know that a profiler is actually active on the current host. Note that that an APM-agent may receive this message zero, one or several times: This may happen if no host agent is active, if one is active or if a host agent is restarted during the lifetime of the APM-agent respectively.

The *message-type* is `2` and the current *minor-version* is `1`.

The payload layout is as follows:
Name | Data type
--------------------- | -------------
samples-delay-ms | uint32
host-id | utf8-str

* *samples-delay-ms*: A sane upper bound of the usual time taken in milliseconds by the profiling host agent between the collection of a stacktrace and it being written to the apm-agent via the [messaging socket](#cpu-profiler-trace-correlation-message). The APM-agent will assume that all profiling data related to a span has been written to the socket if a span ended at least the provided duration ago. Note that this value doesn't need to be a hard a guarantee, but it should be the 99% case so that profiling data isn't distorted in the expected case.
* *host-id*: The [`host.id` resource attribute](https://opentelemetry.io/docs/specs/semconv/attributes-registry/host/) used for the profiling data by this profiling host agent. If an APM-agent is already sending a `host.id` it MUST print a warning if the `host.id` is different and otherwise ignore the value received by the host agent. A mismatch will lead to certain correlation features (e.g. cost and CO2 consumption) not working. If an agent does not collect the `host.id` by itself, it MUST start sending the `host.id` after receiving it from the profiler host agent to ensure aforementioned correlation features work correctly.


## CPU Profiler Trace Correlation Message

Whenever the profiler is able to correlate a taken CPU stacktrace sample with an APM trace (see [this section](#thread-local-storage-layout)). It sends the ID of the stacktrace back to the APM agent.
Expand All @@ -188,6 +203,6 @@ stack-trace-id | uint8[16]
count | uint16

* *trace-id*: The APM W3C trace id of the trace which was active for the given profiling samples
* *trace-id*: The APM W3C transaction id of the transaction which was active for the given profiling samples
* *transaction-id*: The APM W3C transaction id of the transaction which was active for the given profiling samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the W3C spec contains anything about the "transactions" ? From what I recall it's only about tracestate and traceparent (with a trace-id and parent-id as fields) (ref) but I might definitely have missed something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, this should be The APM W3C span id of the transaction

* *stack-trace-id*: The unique ID for the stacktrace captured assigned by the profiler. This ID is stored in elasticsearch in base64 URL safe encoding by the universal profiling solution.
* The number of samples observed since the last report for the (*trace-id*, *transaction-id*, *stack-trace-id*) combination.
Loading